I have object have constructed as following:
Class ObjExample {
String s;
Object[] objArray; // element in this array can be primitive type or array of primitive type.
}
I know that to using it as output type for mapper or reducer, we have to implement WritableComparable for it.
But I really get confused how to write readFields(), write(), compareTo() for this kind of class?
You can wrap field s in Text and objArray in ArrayWritable. Each element of the objArray would be an array (also ArrayWritable) of primitives. Here is possible implementation:
public static final class ObjExample implements WritableComparable<ObjExample> {
public final Text s = new Text(); // wrapped String
public final ArrayOfArrays objArray = new ArrayOfArrays();
#Override
public int compareTo(ObjExample o) {
// your logic here, example:
return s.compareTo(o.s);
}
#Override
public void write(DataOutput dataOutput) throws IOException {
s.write(dataOutput);
objArray.write(dataOutput);
}
#Override
public void readFields(DataInput dataInput) throws IOException {
s.readFields(dataInput);
objArray.readFields(dataInput);
}
// set size of the objArray
public void setSize(int n) {
objArray.set(new IntArray[n]);
}
// set i-th element of the objArray to an array of elements
public void setElement(int i, IntWritable... elements) {
IntArray subArr = new IntArray();
subArr.set(elements);
objArray.get()[i] = subArr;
}
}
You will need two more classes to make it work:
// array of primitives
public static final class IntArray extends ArrayWritable {
public IntArray() {
// you can specify any other primitive wrapper (DoubleWritable, Text, ...)
super(IntWritable.class);
}
}
// array of arrays
public static final class ArrayOfArrays extends ArrayWritable {
public ArrayOfArrays() {
super(IntArray.class);
}
}
Example of construction of the object:
ObjExample o = new ObjExample();
o.s.set("hello");
o.setSize(2);
o.setElement(0, new IntWritable(0)); // single primitive
o.setElement(1, new IntWritable(1), new IntWritable(2)); // array of primitives
Related
I am learning Flink and I started with a simple word count using DataStream. To enhance the processing I filtered the output to show only the results with 3 or more words found.
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9000)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.apply(new MyWindowFunction())
.sum(1)
.filter(word -> word.f1 >= 3);
I would like to create a WindowFunction to sort the output by the value of words found. The WindowFunction that I am trying to implement does not compile at all. I am struggling to define the apply method and the parameters of the WindowFunction interface.
public static class MyWindowFunction implements WindowFunction<
Tuple2<String, Integer>, // input type
Tuple2<String, Integer>, // output type
Tuple2<String, Integer>, // key type
TimeWindow> {
void apply(Tuple2<String, Integer> key, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) {
String word = ((Tuple2<String, Integer>)key).f0;
Integer count = ((Tuple2<String, Integer>)key).f1;
.........
out.collect(new Tuple2<>(word, count));
}
}
I am updating this answer to use Flink 1.12.0. In order to sort the elements of a stream in I had to use a KeyedProcessFunction after counting the stream with a ReduceFunction. Then I had to set the parallelism of the very last transformation to 1 in order to not change the order of the elements that I sorted using KeyedProcessFunction. The sequence that I am using is socketTextStream -> flatMap -> keyBy -> reduce -> keyBy -> process -> print().setParallelism(1). Bellow it the example:
public class SocketWindowWordCountJava {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream("localhost", 9000)
.flatMap(new SplitterFlatMap())
.keyBy(new WordKeySelector())
.reduce(new SumReducer())
.keyBy(new WordKeySelector())
.process(new SortKeyedProcessFunction(3 * 1000))
.print().setParallelism(1);
String executionPlan = env.getExecutionPlan();
System.out.println("ExecutionPlan ........................ ");
System.out.println(executionPlan);
System.out.println("........................ ");
env.execute("Window WordCount sorted");
}
}
The UDF that I used to sort the stream is the SortKeyedProcessFunction which extends KeyedProcessFunction. I use a ValueState<List<Event>> listState of Event implements Comparable<Event> to have a sorted list as state. On the processElement method I register the time stamp that I added the event to the state context.timerService().registerProcessingTimeTimer(timeoutTime); and I collect the event at the onTimer method. I am also using a time window of 3 seconds here.
public class SortKeyedProcessFunction extends KeyedProcessFunction<String, Tuple2<String, Integer>, Event> {
private static final long serialVersionUID = 7289761960983988878L;
// delay after which an alert flag is thrown
private final long timeOut;
// state to remember the last timer set
private ValueState<List<Event>> listState = null;
private ValueState<Long> lastTime = null;
public SortKeyedProcessFunction(long timeOut) {
this.timeOut = timeOut;
}
#Override
public void open(Configuration conf) {
// setup timer and HLL state
ValueStateDescriptor<List<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<List<Event>>() {
}));
listState = getRuntimeContext().getState(descriptor);
ValueStateDescriptor<Long> descriptorLastTime = new ValueStateDescriptor<Long>(
"lastTime",
TypeInformation.of(new TypeHint<Long>() {
}));
lastTime = getRuntimeContext().getState(descriptorLastTime);
}
#Override
public void processElement(Tuple2<String, Integer> value, Context context, Collector<Event> collector) throws Exception {
// get current time and compute timeout time
long currentTime = context.timerService().currentProcessingTime();
long timeoutTime = currentTime + timeOut;
// register timer for timeout time
context.timerService().registerProcessingTimeTimer(timeoutTime);
List<Event> queue = listState.value();
if (queue == null) {
queue = new ArrayList<Event>();
}
Long current = lastTime.value();
queue.add(new Event(value.f0, value.f1));
lastTime.update(timeoutTime);
listState.update(queue);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Event> out) throws Exception {
// System.out.println("onTimer: " + timestamp);
// check if this was the last timer we registered
System.out.println("timestamp: " + timestamp);
List<Event> queue = listState.value();
Long current = lastTime.value();
if (timestamp == current.longValue()) {
Collections.sort(queue);
queue.forEach( e -> {
out.collect(e);
});
queue.clear();
listState.clear();
}
}
}
class Event implements Comparable<Event> {
String value;
Integer qtd;
public Event(String value, Integer qtd) {
this.value = value;
this.qtd = qtd;
}
public String getValue() { return value; }
public Integer getQtd() { return qtd; }
#Override
public String toString() {
return "Event{" +"value='" + value + '\'' +", qtd=" + qtd +'}';
}
#Override
public int compareTo(#NotNull Event event) {
return this.getValue().compareTo(event.getValue());
}
}
So when I use $ nc -lk 9000 and type the words on the console I see them in order on the output
...
Event{value='soccer', qtd=7}
Event{value='swim', qtd=5}
...
Event{value='basketball', qtd=9}
Event{value='soccer', qtd=8}
Event{value='swim', qtd=6}
The other UDFs are for the other transformations of the stream program and they are here for completeness.
public class SplitterFlatMap implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 3121588720675797629L;
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word : sentence.split(" ")) {
out.collect(Tuple2.of(word, 1));
}
}
}
public class WordKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
#Override
public String getKey(Tuple2<String, Integer> value) throws Exception {
return value.f0;
}
}
public class SumReducer implements ReduceFunction<Tuple2<String, Integer>> {
#Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> event1, Tuple2<String, Integer> event2) throws Exception {
return Tuple2.of(event1.f0, event1.f1 + event2.f1);
}
}
The .sum(1) method will do everything you need (no need for using apply()), as long as the Splitter class (which should be a FlatMapFunction) is emitting Tuple2<String, Integer> records, where String is the word, and Integer is always 1.
So then .sum(1) will do the aggregation for you. If you needed something different than what sum() does, you would typically use .reduce(new MyCustomReduceFunction()), as that's going to be the most efficient and scalable approach, in terms of not needing to buffer lots in memory.
I'm trying to implement a heap sorting algorithm.
My problem is when I try to insert Elements to my PriorityQueue, it only works for one element. When I add multiple elements to it, I get these errors
Exception in thread "main" java.lang.ClassCastException: Element cannot be cast to java.lang.Comparable
at java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:652)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:647)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at PQHeap.insert(PQHeap.java:47)
at PQHeap.main(PQHeap.java:17)
This is my Element class
public class Element {
public int key;
public Object data;
public Element(int i, Object o) {
this.key = i;
this.data = o;
}}
The interface:
public interface PQ {
public Element extractMin();
public void insert(Element e);
}
And this is the class, which generates the heap. Note that the main class is located here just to debug with. When I only insert Element e, it works. But when I insert f aswell, it give's me the errors above.
import java.util.*;
public class PQHeap implements PQ{
public static void main(String[] args) {
PQHeap hq = new PQHeap(5);
Element e = new Element(5, null);
hq.insert(e);
hq.insert(f);
for(int in = 0; in<hq.pq.size();in++){
System.out.println(hq.pq.remove());
}
}// end of main method
public PriorityQueue<Element> pq;
public PQHeap(int maxElms) {
this.pq = new PriorityQueue<Element>(maxElms);
}
#Override
public Element extractMin() {
Element min = pq.remove();
System.out.println(min.key);
return min;
}
#Override
public void insert(Element e) {
this.pq.add(e);
}
I have this Class:
public class Stripe implements WritableComparable<Stripe>{
private List<Term> occorrenze = new ArrayList<Term>();
public Stripe(){}
#Override
public void readFields(DataInput in) throws IOException {
}
}
public class Term implements WritableComparable<Term> {
private Text key;
private IntWritable frequency;
#Override
public void readFields(DataInput in) throws IOException {
this.key.readFields(in);
this.frequency.readFields(in);
}
Stripe is a List of Term (pair of Text and intWritable).
how can I set the method "readField" for read the complex type Stripe from DataInput?
To serialize an list you'll need to write out the length of the list, followed by the elements themselves. A simple readFields / write method pair for Stripe could be:
#Override
public void readFields(DataInput in) throws IOException {
occorrenze.clear();
int cnt = in.readInt();
for (int x = 0; x < cnt; x++) {
Term term = new Term();
term.readFields(in);
occorrence.add(term);
}
}
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(occorrenze.size());
for (Term term : occorrenze) {
term.write(out);
}
}
You could make this more efficient by using a VInt rather than an int, and by using a pool of Terms which can be re-used to save on object creation / garbage collection in the readFields method
You could use ArrayWritable which is a list of writables of the same type.
I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [http://stackoverflow.com/questions/12427090/hadoop-composite-key][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
conf.setJobName("raw_parser");
conf.setOutputKeyClass(LinkKeyWritable.class);
conf.setOutputValueClass(LinkValueWritable.class);
conf.setMapperClass(RawMap.class);
conf.setNumMapTasks(0);
conf.setInputFormat(PerFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
parser.addPage(content);
}
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
}
}
}
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
LinkKeyWritable:
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
super();
link = new Link();
}
public LinkKeyWritable(Link link) {
super();
this.link = link;
}
#Override
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeLong(link.batchDay);
out.writeUTF(link.source);
out.writeUTF(link.domain);
out.writeUTF(link.path);
}
#Override
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
compare(link.batchDay, o.link.batchDay).
compare(link.domain, o.link.domain).
compare(link.path, o.link.path).
result();
}
#Override
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay, o.link.batchDay)
&& Objects.equal(link.source, o.link.source)
&& Objects.equal(link.domain, o.link.domain)
&& Objects.equal(link.path, o.link.path);
}
return false;
}
}
LinkValueWritable:
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
}
public LinkValueWritable(Link link) {
this.link = new Link();
this.link.type = link.type;
this.link.description = link.description;
}
#Override
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(link.type);
out.writeUTF(link.description);
}
#Override
public int hashCode() {
return Objects.hashCode(link.type, link.description);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type, o.link.type)
&& Objects.equal(link.description, o.link.description);
}
return false;
}
}
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
/**
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.
I have this hadoop map reduce code that works on graph data (in adjacency list form) and kind of similar to in-adjacency list to out-adjacency list transformation algorithms. The main MapReduce Task code is following:
public class TestTask extends Configured
implements Tool {
public static class TTMapper extends MapReduceBase
implements Mapper<Text, TextArrayWritable, Text, NeighborWritable> {
#Override
public void map(Text key,
TextArrayWritable value,
OutputCollector<Text, NeighborWritable> output,
Reporter reporter) throws IOException {
int numNeighbors = value.get().length;
double weight = (double)1 / numNeighbors;
Text[] neighbors = (Text[]) value.toArray();
NeighborWritable me = new NeighborWritable(key, new DoubleWritable(weight));
for (int i = 0; i < neighbors.length; i++) {
output.collect(neighbors[i], me);
}
}
}
public static class TTReducer extends MapReduceBase
implements Reducer<Text, NeighborWritable, Text, Text> {
#Override
public void reduce(Text key,
Iterator<NeighborWritable> values,
OutputCollector<Text, Text> output,
Reporter arg3)
throws IOException {
ArrayList<NeighborWritable> neighborList = new ArrayList<NeighborWritable>();
while(values.hasNext()) {
neighborList.add(values.next());
}
NeighborArrayWritable neighbors = new NeighborArrayWritable
(neighborList.toArray(new NeighborWritable[0]));
Text out = new Text(neighbors.toString());
output.collect(key, out);
}
}
#Override
public int run(String[] arg0) throws Exception {
JobConf conf = Util.getMapRedJobConf("testJob",
SequenceFileInputFormat.class,
TTMapper.class,
Text.class,
NeighborWritable.class,
1,
TTReducer.class,
Text.class,
Text.class,
TextOutputFormat.class,
"test/in",
"test/out");
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new TestTask(), args);
System.exit(res);
}
}
The auxiliary code is following:
TextArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
public TextArrayWritable(Text[] values) {
super(Text.class, values);
}
}
NeighborWritable:
public class NeighborWritable implements Writable {
private Text nodeId;
private DoubleWritable weight;
public NeighborWritable(Text nodeId, DoubleWritable weight) {
this.nodeId = nodeId;
this.weight = weight;
}
public NeighborWritable () { }
public Text getNodeId() {
return nodeId;
}
public DoubleWritable getWeight() {
return weight;
}
public void setNodeId(Text nodeId) {
this.nodeId = nodeId;
}
public void setWeight(DoubleWritable weight) {
this.weight = weight;
}
#Override
public void readFields(DataInput in) throws IOException {
nodeId = new Text();
nodeId.readFields(in);
weight = new DoubleWritable();
weight.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
nodeId.write(out);
weight.write(out);
}
public String toString() {
return "NW[nodeId=" + (nodeId != null ? nodeId.toString() : "(null)") +
",weight=" + (weight != null ? weight.toString() : "(null)") + "]";
}
public boolean equals(Object o) {
if (!(o instanceof NeighborWritable)) {
return false;
}
NeighborWritable that = (NeighborWritable)o;
return (nodeId.equals(that.getNodeId()) && (weight.equals(that.getWeight())));
}
}
and the Util class:
public class Util {
public static JobConf getMapRedJobConf(String jobName,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass,
Class<?> mapOutputKeyClass,
Class<?> mapOutputValueClass,
int numReducer,
Class<? extends Reducer> reducerClass,
Class<?> outputKeyClass,
Class<?> outputValueClass,
Class<? extends OutputFormat> outputFormatClass,
String inputDir,
String outputDir) throws IOException {
JobConf conf = new JobConf();
if (jobName != null)
conf.setJobName(jobName);
conf.setInputFormat(inputFormatClass);
conf.setMapperClass(mapperClass);
if (numReducer == 0) {
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
} else {
// may set actual number of reducers
// conf.setNumReduceTasks(numReducer);
conf.setMapOutputKeyClass(mapOutputKeyClass);
conf.setMapOutputValueClass(mapOutputValueClass);
conf.setReducerClass(reducerClass);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
}
// delete the existing target output folder
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(outputDir), true);
// specify input and output DIRECTORIES (not files)
FileInputFormat.addInputPath(conf, new Path(inputDir));
FileOutputFormat.setOutputPath(conf, new Path(outputDir));
return conf;
}
}
My input is following graph: (in binary format, here I am giving the text format)
1 2
2 1,3,5
3 2,4
4 3,5
5 2,4
According to the logic of the code the output should be:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],NW[nodeId=1,weight=1.0],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
But the output is coming as:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
I cannot understand the reason why the expected output is not coming out. Any help will be appreciated.
Thanks.
You're falling foul of object re-use
while(values.hasNext()) {
neighborList.add(values.next());
}
values.next() will return the same object reference, but the underlying contents of that object will change for each iteration (the readFields method is called to re-populate the contents)
Suggest you amend to (you'll need to obtain the Configuration conf variable from a setup method, unless you can obtain it from the Reporter or OutputCollector - sorry i don't use the old API)
while(values.hasNext()) {
neighborList.add(
ReflectionUtils.copy(conf, values.next(), new NeighborWritable());
}
But I still can't understand why my unit test passed then. Here is the code -
public class UWLTInitReducerTest {
private Text key;
private Iterator<NeighborWritable> values;
private NeighborArrayWritable nodeData;
private TTReducer reducer;
/**
* Set up the states for calling the map function
*/
#Before
public void setUp() throws Exception {
key = new Text("1001");
NeighborWritable[] neighbors = new NeighborWritable[4];
for (int i = 0; i < 4; i++) {
neighbors[i] = new NeighborWritable(new Text("300" + i), new DoubleWritable((double) 1 / (1 + i)));
}
values = Arrays.asList(neighbors).iterator();
nodeData = new NeighborArrayWritable(neighbors);
reducer = new TTReducer();
}
/**
* Test method for InitModelMapper#map - valid input
*/
#Test
public void testMapValid() {
// mock the output object
OutputCollector<Text, UWLTNodeData> output = mock(OutputCollector.class);
try {
// call the API
reducer.reduce(key, values, output, null);
// in order (sequential) verification of the calls to output.collect()
verify(output).collect(key, nodeData);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Why didn't this code catch the bug?