Related
I tried to write a udf function to calculate my data. In the trino's docs, I knew I should to write a function plugin and I succeed to execute my udf aggregate function sql.
But when I write sql with aggregate function and window function, the sql executed failed.
The error log is com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: com/example/ListState.
I think I may implement the interface about the window function.
The ListState.java file code
#AccumulatorStateMetadata(stateSerializerClass = ListStateSerializer.class, stateFactoryClass = ListStateFactory.class)
public interface ListState extends AccumulatorState {
List<String> getList();
void setList(List<String> value);
}
The ListStateSerializer file code
public class ListStateSerializer implements AccumulatorStateSerializer<ListState>
{
#Override
public Type getSerializedType() {
return VARCHAR;
}
#Override
public void serialize(ListState state, BlockBuilder out) {
if (state.getList() == null) {
out.appendNull();
return;
}
String value = String.join(",", state.getList());
VARCHAR.writeSlice(out, Slices.utf8Slice(value));
}
#Override
public void deserialize(Block block, int index, ListState state) {
String value = VARCHAR.getSlice(block, index).toStringUtf8();
List<String> list = Arrays.asList(value.split(","));
state.setList(list);
}
}
The ListStateFactory file code
public class ListStateFactory implements AccumulatorStateFactory<ListState> {
public static final class SingleListState implements ListState {
private List<String> list = new ArrayList<>();
#Override
public List<String> getList() {
return list;
}
#Override
public void setList(List<String> value) {
list = value;
}
#Override
public long getEstimatedSize() {
if (list == null) {
return 0;
}
return list.size();
}
}
public static class GroupedListState implements GroupedAccumulatorState, ListState {
private final ObjectBigArray<List<String>> container = new ObjectBigArray<>();
private long groupId;
#Override
public List<String> getList() {
return container.get(groupId);
}
#Override
public void setList(List<String> value) {
container.set(groupId, value);
}
#Override
public void setGroupId(long groupId) {
this.groupId = groupId;
if (this.getList() == null) {
this.setList(new ArrayList<String>());
}
}
#Override
public void ensureCapacity(long size) {
container.ensureCapacity(size);
}
#Override
public long getEstimatedSize() {
return container.sizeOf();
}
}
#Override
public ListState createSingleState() {
return new SingleListState();
}
#Override
public ListState createGroupedState() {
return new GroupedListState();
}
}
Thanks for help!!!!
And I found the WindowAccumulator class in the trino source code. But I don't know how to use it.
How to create a aggregate function for window function?
I'm completely new in Hadoop Framework and I want to write a "MapReduce" program (HadoopJoin.java) that joins on x attribute between two tables R and S. The structure of the two tables is :
R (tag : char, x : int, y : varchar(30))
and
S (tag : char, x : int, z : varchar(30))
For example we have for R table :
r 10 r-10-0
r 11 r-11-0
r 12 r-12-0
r 21 r-21-0
And for S table :
s 11 s-11-0
s 21 s-41-0
s 21 s-41-1
s 12 s-31-0
s 11 s-31-1
The result should look like :
r 11 r-11-0 s 11 s-11-0
etc.
Can anyone help me please ?
It will be very difficult to describe join in mapreduce for someone who is new to this Framework but here I provide a working implementation for your situation and I definitely recommend you to read section 9 of Hadoop The Definitive Guide 4th Eddition. It has described how to implement Join in mapreduce very well.
First of all you might consider using higher level frameworks such as Pig, Hive and Spark because they provide join operation in their core part of implementation.
Secondly There are many ways to implement mapreduce depending of the nature of your data. This ways include map-side join and reduce-side join. In this answer I have implemented the reduce-side join:
Implementation:
First of all we should have two different mapper for two different datset notice that in your case same mapper can be used for two dataset but in many situation you need different mappers for different dataset and because of that I have defined two mappers to make this solution more general:
I have used TextPair that have two attributes, one of them is the key that is used to join data and the other one is a tag that specify which dataset this record belongs to. If it belongs to the first dataset this tag will be 0. otherwise it will be 1.
I have implemented TextPair.FirstComparator to ensure that for each key(join by key) the record of the first dataset is the first key which is received by reducer. And all the other records in second dataset with that id are received after that. Actually this line of code will do the trick for us:
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
So in reducer the first record that we will receive is the record from dataset1 and after that we receive record from dataset2. The only thing that should be done is that we have to write those records.
Mapper for dataset1:
public class JoinDataSet1Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "0"), value);
}
}
Mapper for DataSet2:
public class JoinDataSet2Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "1"), value);
}
}
Reducer:
public class JoinReducer extends Reducer<TextPair, Text, NullWritable, Text> {
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
#Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
#Override
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(iter.next());
while (iter.hasNext()) {
Text record = iter.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(NullWritable.get(), outValue);
}
}
}
Custom key:
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
#Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
#Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString() {
return first + "\t" + second;
}
#Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
}
JobDriver:
public class JoinJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "Join two DataSet");
job.setJarByClass(getClass());
Path ncdcInputPath = new Path(getConf().get("job.input1.path"));
Path stationInputPath = new Path(getConf().get("job.input2.path"));
Path outputPath = new Path(getConf().get("job.output.path"));
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinDataSet1Mapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinDataSet2Mapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.setPartitionerClass(JoinReducer.KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
job.setMapOutputKeyClass(TextPair.class);
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new JoinJob(), args);
System.exit(exitCode);
}
}
No matter how simple I make the compareTo of my complex key, I don't get expected results. With the exception of if I use one key that is the same for every record, it will appropriately reduce to one record. I've also witnessed that this happens only when I process the full load, if I break off a few of the records that didn't reduce and run it on a much smaller scale those records get combined.
The sum of the output records is correct, but there is duplication at the record level of items I would have expected to group together. So where I would expect say 500 records summing up to 5,000, I end up with 1232 records summing up to 5,000 with obvious records that should have been reduced into one.
I've read about the problems with object references and complex keys and values, but I don't see anywhere that I have potential for that left. To that end you will find places that I'm creating new objects that I probably don't need to, but I'm trying everything at this point and will dial it back once it is working.
I'm out of ideas on what to try or where and how to poke to figure this out. Please help!
public static class Map extends
Mapper<LongWritable, Text, IMSTranOut, IMSTranSums> {
//private SimpleDateFormat dtFormat = new SimpleDateFormat("yyyyddd");
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
SimpleDateFormat dtFormat = new SimpleDateFormat("yyyyddd");
IMSTranOut dbKey = new IMSTranOut();
IMSTranSums sumVals = new IMSTranSums();
String[] tokens = line.split(",", -1);
dbKey.setLoadKey(-99);
dbKey.setTranClassKey(-99);
dbKey.setTransactionCode(tokens[0]);
dbKey.setTransactionType(tokens[1]);
dbKey.setNpaNxx(getNPA(dbKey.getTransactionCode()));
try {
dbKey.setTranDate(new Date(dtFormat.parse(tokens[2]).getTime()));
} catch (ParseException e) {
}// 2
dbKey.setTranHour(getTranHour(tokens[3]));
try {
dbKey.setStartDate(new Date(dtFormat.parse(tokens[4]).getTime()));
} catch (ParseException e) {
}// 4
dbKey.setStartHour(getTranHour(tokens[5]));
try {
dbKey.setStopDate(new Date(dtFormat.parse(tokens[6]).getTime()));
} catch (ParseException e) {
}// 6
dbKey.setStopHour(getTranHour(tokens[7]));
sumVals.setTranCount(1);
sumVals.setInputQTime(Double.parseDouble(tokens[8]));
sumVals.setElapsedTime(Double.parseDouble(tokens[9]));
sumVals.setCpuTime(Double.parseDouble(tokens[10]));
context.write(dbKey, sumVals);
}
}
public static class Reduce extends
Reducer<IMSTranOut, IMSTranSums, IMSTranOut, IMSTranSums> {
#Override
public void reduce(IMSTranOut key, Iterable<IMSTranSums> values,
Context context) throws IOException, InterruptedException {
int tranCount = 0;
double inputQ = 0;
double elapsed = 0;
double cpu = 0;
for (IMSTranSums val : values) {
tranCount += val.getTranCount();
inputQ += val.getInputQTime();
elapsed += val.getElapsedTime();
cpu += val.getCpuTime();
}
IMSTranSums sumVals=new IMSTranSums();
IMSTranOut dbKey=new IMSTranOut();
sumVals.setCpuTime(inputQ);
sumVals.setElapsedTime(elapsed);
sumVals.setInputQTime(cpu);
sumVals.setTranCount(tranCount);
dbKey.setLoadKey(key.getLoadKey());
dbKey.setTranClassKey(key.getTranClassKey());
dbKey.setNpaNxx(key.getNpaNxx());
dbKey.setTransactionCode(key.getTransactionCode());
dbKey.setTransactionType(key.getTransactionType());
dbKey.setTranDate(key.getTranDate());
dbKey.setTranHour(key.getTranHour());
dbKey.setStartDate(key.getStartDate());
dbKey.setStartHour(key.getStartHour());
dbKey.setStopDate(key.getStopDate());
dbKey.setStopHour(key.getStopHour());
dbKey.setInputQTime(inputQ);
dbKey.setElapsedTime(elapsed);
dbKey.setCpuTime(cpu);
dbKey.setTranCount(tranCount);
context.write(dbKey, sumVals);
}
}
Here is the implementation of the DBWritable class:
public class IMSTranOut implements DBWritable,
WritableComparable<IMSTranOut> {
private int loadKey;
private int tranClassKey;
private String npaNxx;
private String transactionCode;
private String transactionType;
private Date tranDate;
private double tranHour;
private Date startDate;
private double startHour;
private Date stopDate;
private double stopHour;
private double inputQTime;
private double elapsedTime;
private double cpuTime;
private int tranCount;
public void readFields(ResultSet rs) throws SQLException {
setLoadKey(rs.getInt("LOAD_KEY"));
setTranClassKey(rs.getInt("TRAN_CLASS_KEY"));
setNpaNxx(rs.getString("NPA_NXX"));
setTransactionCode(rs.getString("TRANSACTION_CODE"));
setTransactionType(rs.getString("TRANSACTION_TYPE"));
setTranDate(rs.getDate("TRAN_DATE"));
setTranHour(rs.getInt("TRAN_HOUR"));
setStartDate(rs.getDate("START_DATE"));
setStartHour(rs.getInt("START_HOUR"));
setStopDate(rs.getDate("STOP_DATE"));
setStopHour(rs.getInt("STOP_HOUR"));
setInputQTime(rs.getInt("INPUT_Q_TIME"));
setElapsedTime(rs.getInt("ELAPSED_TIME"));
setCpuTime(rs.getInt("CPU_TIME"));
setTranCount(rs.getInt("TRAN_COUNT"));
}
public void write(PreparedStatement ps) throws SQLException {
ps.setInt(1, loadKey);
ps.setInt(2, tranClassKey);
ps.setString(3, npaNxx);
ps.setString(4, transactionCode);
ps.setString(5, transactionType);
ps.setDate(6, tranDate);
ps.setDouble(7, tranHour);
ps.setDate(8, startDate);
ps.setDouble(9, startHour);
ps.setDate(10, stopDate);
ps.setDouble(11, stopHour);
ps.setDouble(12, inputQTime);
ps.setDouble(13, elapsedTime);
ps.setDouble(14, cpuTime);
ps.setInt(15, tranCount);
}
public int getLoadKey() {
return loadKey;
}
public void setLoadKey(int loadKey) {
this.loadKey = loadKey;
}
public int getTranClassKey() {
return tranClassKey;
}
public void setTranClassKey(int tranClassKey) {
this.tranClassKey = tranClassKey;
}
public String getNpaNxx() {
return npaNxx;
}
public void setNpaNxx(String npaNxx) {
this.npaNxx = new String(npaNxx);
}
public String getTransactionCode() {
return transactionCode;
}
public void setTransactionCode(String transactionCode) {
this.transactionCode = new String(transactionCode);
}
public String getTransactionType() {
return transactionType;
}
public void setTransactionType(String transactionType) {
this.transactionType = new String(transactionType);
}
public Date getTranDate() {
return tranDate;
}
public void setTranDate(Date tranDate) {
this.tranDate = new Date(tranDate.getTime());
}
public double getTranHour() {
return tranHour;
}
public void setTranHour(double tranHour) {
this.tranHour = tranHour;
}
public Date getStartDate() {
return startDate;
}
public void setStartDate(Date startDate) {
this.startDate = new Date(startDate.getTime());
}
public double getStartHour() {
return startHour;
}
public void setStartHour(double startHour) {
this.startHour = startHour;
}
public Date getStopDate() {
return stopDate;
}
public void setStopDate(Date stopDate) {
this.stopDate = new Date(stopDate.getTime());
}
public double getStopHour() {
return stopHour;
}
public void setStopHour(double stopHour) {
this.stopHour = stopHour;
}
public double getInputQTime() {
return inputQTime;
}
public void setInputQTime(double inputQTime) {
this.inputQTime = inputQTime;
}
public double getElapsedTime() {
return elapsedTime;
}
public void setElapsedTime(double elapsedTime) {
this.elapsedTime = elapsedTime;
}
public double getCpuTime() {
return cpuTime;
}
public void setCpuTime(double cpuTime) {
this.cpuTime = cpuTime;
}
public int getTranCount() {
return tranCount;
}
public void setTranCount(int tranCount) {
this.tranCount = tranCount;
}
public void readFields(DataInput input) throws IOException {
setNpaNxx(input.readUTF());
setTransactionCode(input.readUTF());
setTransactionType(input.readUTF());
setTranDate(new Date(input.readLong()));
setStartDate(new Date(input.readLong()));
setStopDate(new Date(input.readLong()));
setLoadKey(input.readInt());
setTranClassKey(input.readInt());
setTranHour(input.readDouble());
setStartHour(input.readDouble());
setStopHour(input.readDouble());
setInputQTime(input.readDouble());
setElapsedTime(input.readDouble());
setCpuTime(input.readDouble());
setTranCount(input.readInt());
}
public void write(DataOutput output) throws IOException {
output.writeUTF(npaNxx);
output.writeUTF(transactionCode);
output.writeUTF(transactionType);
output.writeLong(tranDate.getTime());
output.writeLong(startDate.getTime());
output.writeLong(stopDate.getTime());
output.writeInt(loadKey);
output.writeInt(tranClassKey);
output.writeDouble(tranHour);
output.writeDouble(startHour);
output.writeDouble(stopHour);
output.writeDouble(inputQTime);
output.writeDouble(elapsedTime);
output.writeDouble(cpuTime);
output.writeInt(tranCount);
}
public int compareTo(IMSTranOut o) {
return (Integer.compare(loadKey, o.getLoadKey()) == 0
&& Integer.compare(tranClassKey, o.getTranClassKey()) == 0
&& npaNxx.compareTo(o.getNpaNxx()) == 0
&& transactionCode.compareTo(o.getTransactionCode()) == 0
&& (transactionType.compareTo(o.getTransactionType()) == 0)
&& tranDate.compareTo(o.getTranDate()) == 0
&& Double.compare(tranHour, o.getTranHour()) == 0
&& startDate.compareTo(o.getStartDate()) == 0
&& Double.compare(startHour, o.getStartHour()) == 0
&& stopDate.compareTo(o.getStopDate()) == 0
&& Double.compare(stopHour, o.getStopHour()) == 0) ? 0 : 1;
}
}
Implementation of the Writable class for the complex values:
public class IMSTranSums
implements Writable{
private double inputQTime;
private double elapsedTime;
private double cpuTime;
private int tranCount;
public double getInputQTime() {
return inputQTime;
}
public void setInputQTime(double inputQTime) {
this.inputQTime = inputQTime;
}
public double getElapsedTime() {
return elapsedTime;
}
public void setElapsedTime(double elapsedTime) {
this.elapsedTime = elapsedTime;
}
public double getCpuTime() {
return cpuTime;
}
public void setCpuTime(double cpuTime) {
this.cpuTime = cpuTime;
}
public int getTranCount() {
return tranCount;
}
public void setTranCount(int tranCount) {
this.tranCount = tranCount;
}
public void write(DataOutput output) throws IOException {
output.writeDouble(inputQTime);
output.writeDouble(elapsedTime);
output.writeDouble(cpuTime);
output.writeInt(tranCount);
}
public void readFields(DataInput input) throws IOException {
inputQTime=input.readDouble();
elapsedTime=input.readDouble();
cpuTime=input.readDouble();
tranCount=input.readInt();
}
}
Your compareTo is flawed, it will totally fail the sort algorithm, because you seem to break transivity in your ordering.
I would recommend you to use a CompareToBuilder from Apache Commons or a ComparisonChain from Guava to make your comparisons much more readable (and correct!).
I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [http://stackoverflow.com/questions/12427090/hadoop-composite-key][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
conf.setJobName("raw_parser");
conf.setOutputKeyClass(LinkKeyWritable.class);
conf.setOutputValueClass(LinkValueWritable.class);
conf.setMapperClass(RawMap.class);
conf.setNumMapTasks(0);
conf.setInputFormat(PerFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
parser.addPage(content);
}
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
}
}
}
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
LinkKeyWritable:
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
super();
link = new Link();
}
public LinkKeyWritable(Link link) {
super();
this.link = link;
}
#Override
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeLong(link.batchDay);
out.writeUTF(link.source);
out.writeUTF(link.domain);
out.writeUTF(link.path);
}
#Override
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
compare(link.batchDay, o.link.batchDay).
compare(link.domain, o.link.domain).
compare(link.path, o.link.path).
result();
}
#Override
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay, o.link.batchDay)
&& Objects.equal(link.source, o.link.source)
&& Objects.equal(link.domain, o.link.domain)
&& Objects.equal(link.path, o.link.path);
}
return false;
}
}
LinkValueWritable:
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
}
public LinkValueWritable(Link link) {
this.link = new Link();
this.link.type = link.type;
this.link.description = link.description;
}
#Override
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(link.type);
out.writeUTF(link.description);
}
#Override
public int hashCode() {
return Objects.hashCode(link.type, link.description);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type, o.link.type)
&& Objects.equal(link.description, o.link.description);
}
return false;
}
}
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
/**
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.
I have this hadoop map reduce code that works on graph data (in adjacency list form) and kind of similar to in-adjacency list to out-adjacency list transformation algorithms. The main MapReduce Task code is following:
public class TestTask extends Configured
implements Tool {
public static class TTMapper extends MapReduceBase
implements Mapper<Text, TextArrayWritable, Text, NeighborWritable> {
#Override
public void map(Text key,
TextArrayWritable value,
OutputCollector<Text, NeighborWritable> output,
Reporter reporter) throws IOException {
int numNeighbors = value.get().length;
double weight = (double)1 / numNeighbors;
Text[] neighbors = (Text[]) value.toArray();
NeighborWritable me = new NeighborWritable(key, new DoubleWritable(weight));
for (int i = 0; i < neighbors.length; i++) {
output.collect(neighbors[i], me);
}
}
}
public static class TTReducer extends MapReduceBase
implements Reducer<Text, NeighborWritable, Text, Text> {
#Override
public void reduce(Text key,
Iterator<NeighborWritable> values,
OutputCollector<Text, Text> output,
Reporter arg3)
throws IOException {
ArrayList<NeighborWritable> neighborList = new ArrayList<NeighborWritable>();
while(values.hasNext()) {
neighborList.add(values.next());
}
NeighborArrayWritable neighbors = new NeighborArrayWritable
(neighborList.toArray(new NeighborWritable[0]));
Text out = new Text(neighbors.toString());
output.collect(key, out);
}
}
#Override
public int run(String[] arg0) throws Exception {
JobConf conf = Util.getMapRedJobConf("testJob",
SequenceFileInputFormat.class,
TTMapper.class,
Text.class,
NeighborWritable.class,
1,
TTReducer.class,
Text.class,
Text.class,
TextOutputFormat.class,
"test/in",
"test/out");
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new TestTask(), args);
System.exit(res);
}
}
The auxiliary code is following:
TextArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
public TextArrayWritable(Text[] values) {
super(Text.class, values);
}
}
NeighborWritable:
public class NeighborWritable implements Writable {
private Text nodeId;
private DoubleWritable weight;
public NeighborWritable(Text nodeId, DoubleWritable weight) {
this.nodeId = nodeId;
this.weight = weight;
}
public NeighborWritable () { }
public Text getNodeId() {
return nodeId;
}
public DoubleWritable getWeight() {
return weight;
}
public void setNodeId(Text nodeId) {
this.nodeId = nodeId;
}
public void setWeight(DoubleWritable weight) {
this.weight = weight;
}
#Override
public void readFields(DataInput in) throws IOException {
nodeId = new Text();
nodeId.readFields(in);
weight = new DoubleWritable();
weight.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
nodeId.write(out);
weight.write(out);
}
public String toString() {
return "NW[nodeId=" + (nodeId != null ? nodeId.toString() : "(null)") +
",weight=" + (weight != null ? weight.toString() : "(null)") + "]";
}
public boolean equals(Object o) {
if (!(o instanceof NeighborWritable)) {
return false;
}
NeighborWritable that = (NeighborWritable)o;
return (nodeId.equals(that.getNodeId()) && (weight.equals(that.getWeight())));
}
}
and the Util class:
public class Util {
public static JobConf getMapRedJobConf(String jobName,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass,
Class<?> mapOutputKeyClass,
Class<?> mapOutputValueClass,
int numReducer,
Class<? extends Reducer> reducerClass,
Class<?> outputKeyClass,
Class<?> outputValueClass,
Class<? extends OutputFormat> outputFormatClass,
String inputDir,
String outputDir) throws IOException {
JobConf conf = new JobConf();
if (jobName != null)
conf.setJobName(jobName);
conf.setInputFormat(inputFormatClass);
conf.setMapperClass(mapperClass);
if (numReducer == 0) {
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
} else {
// may set actual number of reducers
// conf.setNumReduceTasks(numReducer);
conf.setMapOutputKeyClass(mapOutputKeyClass);
conf.setMapOutputValueClass(mapOutputValueClass);
conf.setReducerClass(reducerClass);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
}
// delete the existing target output folder
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(outputDir), true);
// specify input and output DIRECTORIES (not files)
FileInputFormat.addInputPath(conf, new Path(inputDir));
FileOutputFormat.setOutputPath(conf, new Path(outputDir));
return conf;
}
}
My input is following graph: (in binary format, here I am giving the text format)
1 2
2 1,3,5
3 2,4
4 3,5
5 2,4
According to the logic of the code the output should be:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],NW[nodeId=1,weight=1.0],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
But the output is coming as:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
I cannot understand the reason why the expected output is not coming out. Any help will be appreciated.
Thanks.
You're falling foul of object re-use
while(values.hasNext()) {
neighborList.add(values.next());
}
values.next() will return the same object reference, but the underlying contents of that object will change for each iteration (the readFields method is called to re-populate the contents)
Suggest you amend to (you'll need to obtain the Configuration conf variable from a setup method, unless you can obtain it from the Reporter or OutputCollector - sorry i don't use the old API)
while(values.hasNext()) {
neighborList.add(
ReflectionUtils.copy(conf, values.next(), new NeighborWritable());
}
But I still can't understand why my unit test passed then. Here is the code -
public class UWLTInitReducerTest {
private Text key;
private Iterator<NeighborWritable> values;
private NeighborArrayWritable nodeData;
private TTReducer reducer;
/**
* Set up the states for calling the map function
*/
#Before
public void setUp() throws Exception {
key = new Text("1001");
NeighborWritable[] neighbors = new NeighborWritable[4];
for (int i = 0; i < 4; i++) {
neighbors[i] = new NeighborWritable(new Text("300" + i), new DoubleWritable((double) 1 / (1 + i)));
}
values = Arrays.asList(neighbors).iterator();
nodeData = new NeighborArrayWritable(neighbors);
reducer = new TTReducer();
}
/**
* Test method for InitModelMapper#map - valid input
*/
#Test
public void testMapValid() {
// mock the output object
OutputCollector<Text, UWLTNodeData> output = mock(OutputCollector.class);
try {
// call the API
reducer.reduce(key, values, output, null);
// in order (sequential) verification of the calls to output.collect()
verify(output).collect(key, nodeData);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Why didn't this code catch the bug?