Sorting DataStream using Apache Flink - sorting

I am learning Flink and I started with a simple word count using DataStream. To enhance the processing I filtered the output to show only the results with 3 or more words found.
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9000)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.apply(new MyWindowFunction())
.sum(1)
.filter(word -> word.f1 >= 3);
I would like to create a WindowFunction to sort the output by the value of words found. The WindowFunction that I am trying to implement does not compile at all. I am struggling to define the apply method and the parameters of the WindowFunction interface.
public static class MyWindowFunction implements WindowFunction<
Tuple2<String, Integer>, // input type
Tuple2<String, Integer>, // output type
Tuple2<String, Integer>, // key type
TimeWindow> {
void apply(Tuple2<String, Integer> key, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) {
String word = ((Tuple2<String, Integer>)key).f0;
Integer count = ((Tuple2<String, Integer>)key).f1;
.........
out.collect(new Tuple2<>(word, count));
}
}

I am updating this answer to use Flink 1.12.0. In order to sort the elements of a stream in I had to use a KeyedProcessFunction after counting the stream with a ReduceFunction. Then I had to set the parallelism of the very last transformation to 1 in order to not change the order of the elements that I sorted using KeyedProcessFunction. The sequence that I am using is socketTextStream -> flatMap -> keyBy -> reduce -> keyBy -> process -> print().setParallelism(1). Bellow it the example:
public class SocketWindowWordCountJava {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream("localhost", 9000)
.flatMap(new SplitterFlatMap())
.keyBy(new WordKeySelector())
.reduce(new SumReducer())
.keyBy(new WordKeySelector())
.process(new SortKeyedProcessFunction(3 * 1000))
.print().setParallelism(1);
String executionPlan = env.getExecutionPlan();
System.out.println("ExecutionPlan ........................ ");
System.out.println(executionPlan);
System.out.println("........................ ");
env.execute("Window WordCount sorted");
}
}
The UDF that I used to sort the stream is the SortKeyedProcessFunction which extends KeyedProcessFunction. I use a ValueState<List<Event>> listState of Event implements Comparable<Event> to have a sorted list as state. On the processElement method I register the time stamp that I added the event to the state context.timerService().registerProcessingTimeTimer(timeoutTime); and I collect the event at the onTimer method. I am also using a time window of 3 seconds here.
public class SortKeyedProcessFunction extends KeyedProcessFunction<String, Tuple2<String, Integer>, Event> {
private static final long serialVersionUID = 7289761960983988878L;
// delay after which an alert flag is thrown
private final long timeOut;
// state to remember the last timer set
private ValueState<List<Event>> listState = null;
private ValueState<Long> lastTime = null;
public SortKeyedProcessFunction(long timeOut) {
this.timeOut = timeOut;
}
#Override
public void open(Configuration conf) {
// setup timer and HLL state
ValueStateDescriptor<List<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<List<Event>>() {
}));
listState = getRuntimeContext().getState(descriptor);
ValueStateDescriptor<Long> descriptorLastTime = new ValueStateDescriptor<Long>(
"lastTime",
TypeInformation.of(new TypeHint<Long>() {
}));
lastTime = getRuntimeContext().getState(descriptorLastTime);
}
#Override
public void processElement(Tuple2<String, Integer> value, Context context, Collector<Event> collector) throws Exception {
// get current time and compute timeout time
long currentTime = context.timerService().currentProcessingTime();
long timeoutTime = currentTime + timeOut;
// register timer for timeout time
context.timerService().registerProcessingTimeTimer(timeoutTime);
List<Event> queue = listState.value();
if (queue == null) {
queue = new ArrayList<Event>();
}
Long current = lastTime.value();
queue.add(new Event(value.f0, value.f1));
lastTime.update(timeoutTime);
listState.update(queue);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Event> out) throws Exception {
// System.out.println("onTimer: " + timestamp);
// check if this was the last timer we registered
System.out.println("timestamp: " + timestamp);
List<Event> queue = listState.value();
Long current = lastTime.value();
if (timestamp == current.longValue()) {
Collections.sort(queue);
queue.forEach( e -> {
out.collect(e);
});
queue.clear();
listState.clear();
}
}
}
class Event implements Comparable<Event> {
String value;
Integer qtd;
public Event(String value, Integer qtd) {
this.value = value;
this.qtd = qtd;
}
public String getValue() { return value; }
public Integer getQtd() { return qtd; }
#Override
public String toString() {
return "Event{" +"value='" + value + '\'' +", qtd=" + qtd +'}';
}
#Override
public int compareTo(#NotNull Event event) {
return this.getValue().compareTo(event.getValue());
}
}
So when I use $ nc -lk 9000 and type the words on the console I see them in order on the output
...
Event{value='soccer', qtd=7}
Event{value='swim', qtd=5}
...
Event{value='basketball', qtd=9}
Event{value='soccer', qtd=8}
Event{value='swim', qtd=6}
The other UDFs are for the other transformations of the stream program and they are here for completeness.
public class SplitterFlatMap implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 3121588720675797629L;
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word : sentence.split(" ")) {
out.collect(Tuple2.of(word, 1));
}
}
}
public class WordKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
#Override
public String getKey(Tuple2<String, Integer> value) throws Exception {
return value.f0;
}
}
public class SumReducer implements ReduceFunction<Tuple2<String, Integer>> {
#Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> event1, Tuple2<String, Integer> event2) throws Exception {
return Tuple2.of(event1.f0, event1.f1 + event2.f1);
}
}

The .sum(1) method will do everything you need (no need for using apply()), as long as the Splitter class (which should be a FlatMapFunction) is emitting Tuple2<String, Integer> records, where String is the word, and Integer is always 1.
So then .sum(1) will do the aggregation for you. If you needed something different than what sum() does, you would typically use .reduce(new MyCustomReduceFunction()), as that's going to be the most efficient and scalable approach, in terms of not needing to buffer lots in memory.

Related

Memory grows until OOM crash when using a WindowStore in Kafka streams

I built a stream that does a windowed join, when deploying on production all is fine in terms of memory and performance.
However, I needed Deduplication, hence implemented a Transformer that does that with the help of a WindowStore.
After deploying it, we are getting the data results that was expected, but memory keeps growing until the pod crashes with OOM.
After doing research I implemented many tricks to reduce memory usage, but they didn't help, below is the code.
It's clear to me that using the WindowStore is causing this issue, but how to limit it?
The Store:
var storeBuilder = Stores.windowStoreBuilder(
Stores.persistentWindowStore(
storeName,
Duration.ofSeconds(6),
Duration.ofSeconds(5),
false
),
Serdes.String(),
SerdeFactory.JsonSerde(valueDataClass)
);
The stream:
var leftStream = builder.stream("leftTopic").filter(...);
var rightStream = builder.stream("rightTopic").filter(...);
leftStream.join(rightStream, joiner, JoinWindows
.of(Duration.ofSeconds(5))
.grace(Duration.ofSeconds(1))
.until(
Duration.ofSeconds(6)
)
.transformValues(
() ->
new DeduplicationTransformer<>(
storeName,
Duration.ofSeconds(6),
(key, value) -> value.id
),
storeName
)
.filter((k, v) -> v != null)
.to("targetTopic");
Deduplication Transformer:
public class DeduplicationTransformer<K, V extends StreamModel>
implements ValueTransformerWithKey<K, V, V> {
private ProcessorContext context;
private String storeName;
private WindowStore<K, V> eventIdStore;
private final long leftDurationMs;
private final KeyValueMapper<K, V, K> idExtractor;
public DeduplicationTransformer(
String storeName,
long maintainDurationPerEventInMs,
final KeyValueMapper<K, V, K> idExtractor
) {
if (maintainDurationPerEventInMs < 2) {
throw new IllegalArgumentException(
"maintain duration per event must be > 1"
);
}
leftDurationMs = maintainDurationPerEventInMs;
this.idExtractor = idExtractor;
this.storeName = storeName;
}
#Override
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<K, V>) context.getStateStore(storeName);
Duration interval = Duration.ofMillis(leftDurationMs);
this.context.schedule(
interval,
PunctuationType.WALL_CLOCK_TIME,
timestamp -> {
Instant from = Instant.ofEpochMilli(
System.currentTimeMillis() - leftDurationMs * 2
);
Instant to = Instant.ofEpochMilli(
System.currentTimeMillis() - leftDurationMs
);
KeyValueIterator<Windowed<K>, V> iterator = eventIdStore.fetchAll(
from,
to
);
while (iterator.hasNext()) {
KeyValue<Windowed<K>, V> entry = iterator.next();
eventIdStore.put(entry.key.key(), null, entry.key.window().start());
}
iterator.close();
context.commit();
}
);
}
#Override
public V transform(final K key, final V value) {
try {
final K eventId = idExtractor.apply(key, value);
if (eventId == null) {
return value;
} else {
final V output;
if (isDuplicate(eventId)) {
output = null;
} else {
output = value;
rememberNewEvent(eventId, value, context.timestamp());
}
return output;
}
} catch (Exception e) {
return null;
}
}
private boolean isDuplicate(final K eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<V> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime
);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void rememberNewEvent(final K eventId, V v, final long timestamp) {
eventIdStore.put(eventId, v, timestamp);
}
#Override
public void close() {}
}
RocksDB config:
public class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
private Cache cache = new LRUCache(5 * 1024 * 1204L);
private Filter filter = new BloomFilter();
private WriteBufferManager writeBufferManager = new WriteBufferManager(
4 * 1024 * 1204L,
cache
);
#Override
public void setConfig(
final String storeName,
final Options options,
final Map<String, Object> configs
) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
tableConfig.setBlockCache(cache);
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setWriteBufferManager(writeBufferManager);
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(false);
tableConfig.setPinTopLevelIndexAndFilter(true);
tableConfig.setBlockSize(4 * 1024L);
options.setMaxWriteBufferNumber(1);
options.setWriteBufferSize(1024 * 1024L);
options.setTableFormatConfig(tableConfig);
}
#Override
public void close(final String storeName, final Options options) {
cache.close();
filter.close();
}
}
Config:
props.put(
StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG,
0
);
props.put(
StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG,
BoundedMemoryRocksDBConfig.class
);
Things I've tried so far:
Using a bounded RocksDB config setter
Using jemalloc instead of malloc
Reducing the retention period to 5 seconds
Reducing the number of partitions of the topics (this only slowed the rate of the memory leak)
used in-memory stores instead of persistent, and memory was very stable, but the app startup takes around 10 minutes on each deployment.

Kafka Streams: batch keys for a time window and do some processing on the batch of keys together

I have a stream of incoming primary keys (PK) that I am reading in my Kafkastreams app, I would like to batch them over say last 1 minute and query my transactional DB to get more data for the batch of PKs (deduplicated) in the last minute. And for each PK I would like to post a message on output topic.
I was able to code this using Processor API like below:
Topology topology = new Topology();
topology.addSource("test-source", inputKeySerde.deserializer(), inputValueSerde.deserializer(), "input.kafka.topic")
.addProcessor("test-processor", processorSupplier, "test-source")
.addSink("test-sink", "output.kafka.topic", outputKeySerde.serializer(), outputValueSerde.serializer, "test-processor");
Here processor supplier has a process method that adds the PK to a queue and a punctuator that is scheduled to run every minute and drains the queue and queries transactional DB and forwards a message for every PK.
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private BlockingQueue<Integer> ids;
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(Duration.ofMillis(1000), PunctuationType.WALL_CLOCK_TIME, this::punctuate);
ids = new LinkedBlockingQueue<>();
}
#Override
public void process(Integer key, ValueType value) {
ids.add(key);
}
public void punctuate(long timestamp) {
Set<Long> idSet = new HashSet<>();
ids.drainTo(idSet, 1000);
List<Document> documentList = createDocuments(ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
context.commit();
}
#Override
public void close() {
}
};
}
};
Wondering if there is a simpler way to accomplish this using DSL windowedBy and reduce/aggregate route?
***** Updated code to use state store ******
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private KeyValueStore<Integer, Integer> stateStore;
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore("MyStore");
this.context.schedule(Duration.ofMillis(5000), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
Set<Integer> ids = new HashSet<>();
try (KeyValueIterator<Integer, Integer> iter = this.stateStore.all()) {
while (iter.hasNext()) {
KeyValue<Integer, Integer> entry = iter.next();
ids.add(entry.key);
}
}
List<Document> documentList = createDocuments(dataRetriever, ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
ids.stream().forEach(id -> stateStore.delete(id));
this.context.commit();
});
}
#Override
public void process(Integer key, ValueType value) {
Long id = key.getId();
stateStore.put(id, id);
}
#Override
public void close() {
}
};
}
};

Filter index hits by node ids in Neo4j

I have a set of node id's (Set< Long >) and want to restrict or filter the results of an query to only the nodes in this set. Is there a performant way to do this?
Set<Node> query(final GraphDatabaseService graphDb, final Set<Long> nodeSet) {
final Index<Node> searchIndex = graphdb.index().forNodes("search");
final IndexHits<Node> hits = searchIndex.query(new QueryContext("value*"));
// what now to return only index hits that are in the given Set of Node's?
}
Wouldn't be faster the other way round? If you get the nodes from your set and compare the property to the value you are looking for?
for (Iterator it=nodeSet.iterator();it.hasNext();) {
Node n=db.getNodeById(it.next());
if (!n.getProperty("value","").equals("foo")) it.remove();
}
or for your suggestion
Set<Node> query(final GraphDatabaseService graphDb, final Set<Long> nodeSet) {
final Index<Node> searchIndex = graphdb.index().forNodes("search");
final IndexHits<Node> hits = searchIndex.query(new QueryContext("value*"));
Set<Node> result=new HashSet<>();
for (Node n : hits) {
if (nodeSet.contains(n.getId())) result.add(n);
}
return result;
}
So the fastest solution I found was directly using lucenes IndexSearcher on the index created by neo4j and use an custom Filter to restrict the search to specific nodes.
Just open the neo4j index folder "{neo4j-database-folder}/index/lucene/node/{index-name}" with the lucene IndexReader. Make sure to use not add a lucene dependency to your project in another version than the one neo4j uses, which currently is lucene 3.6.2!
here's my lucene Filter implementation that filters all query results by the given Set of document id's. (Lucene Document id's (Integer) ARE NOT Neo4j Node id's (Long)!)
import java.io.IOException;
import java.util.PriorityQueue;
import java.util.Set;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.DocIdSet;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.Filter;
public class DocIdFilter extends Filter {
public class FilteredDocIdSetIterator extends DocIdSetIterator {
private final PriorityQueue<Integer> filterQueue;
private int docId;
public FilteredDocIdSetIterator(final Set<Integer> filterSet) {
this(new PriorityQueue<Integer>(filterSet));
}
public FilteredDocIdSetIterator(final PriorityQueue<Integer> filterQueue) {
this.filterQueue = filterQueue;
}
#Override
public int docID() {
return this.docId;
}
#Override
public int nextDoc() throws IOException {
if (this.filterQueue.isEmpty()) {
this.docId = NO_MORE_DOCS;
} else {
this.docId = this.filterQueue.poll();
}
return this.docId;
}
#Override
public int advance(final int target) throws IOException {
while ((this.docId = this.nextDoc()) < target)
;
return this.docId;
}
}
private final PriorityQueue<Integer> filterQueue;
public DocIdFilter(final Set<Integer> filterSet) {
super();
this.filterQueue = new PriorityQueue<Integer>(filterSet);
}
private static final long serialVersionUID = -865683019349988312L;
#Override
public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
return new DocIdSet() {
#Override
public DocIdSetIterator iterator() throws IOException {
return new FilteredDocIdSetIterator(DocIdFilter.this.filterQueue);
}
};
}
}
To map the set of neo4j node id's (the query result should be filtered with) to the correct lucene document id's i created an inmemory bidirectional map:
public static HashBiMap<Integer, Long> generateDocIdToNodeIdMap(final IndexReader indexReader)
throws LuceneIndexException {
final HashBiMap<Integer, Long> result = HashBiMap.create(indexReader.numDocs());
for (int i = 0; i < indexReader.maxDoc(); i++) {
if (indexReader.isDeleted(i)) {
continue;
}
final Document doc;
try {
doc = indexReader.document(i, new FieldSelector() {
private static final long serialVersionUID = 5853247619312916012L;
#Override
public FieldSelectorResult accept(final String fieldName) {
if ("_id_".equals(fieldName)) {
return FieldSelectorResult.LOAD_AND_BREAK;
} else {
return FieldSelectorResult.NO_LOAD;
}
}
};
);
} catch (final IOException e) {
throw new LuceneIndexException(indexReader.directory(), "could not read document with ID: '" + i
+ "' from index.", e);
}
final Long nodeId;
try {
nodeId = Long.valueOf(doc.get("_id_"));
} catch (final NumberFormatException e) {
throw new LuceneIndexException(indexReader.directory(),
"could not parse node ID value from document ID: '" + i + "'", e);
}
result.put(i, nodeId);
}
return result;
}
I'm using the Google Guava Library that provides an bidirectional map and the initialization of collections with an specific size.

Why Hadoop shuffle not working as expected

I have this hadoop map reduce code that works on graph data (in adjacency list form) and kind of similar to in-adjacency list to out-adjacency list transformation algorithms. The main MapReduce Task code is following:
public class TestTask extends Configured
implements Tool {
public static class TTMapper extends MapReduceBase
implements Mapper<Text, TextArrayWritable, Text, NeighborWritable> {
#Override
public void map(Text key,
TextArrayWritable value,
OutputCollector<Text, NeighborWritable> output,
Reporter reporter) throws IOException {
int numNeighbors = value.get().length;
double weight = (double)1 / numNeighbors;
Text[] neighbors = (Text[]) value.toArray();
NeighborWritable me = new NeighborWritable(key, new DoubleWritable(weight));
for (int i = 0; i < neighbors.length; i++) {
output.collect(neighbors[i], me);
}
}
}
public static class TTReducer extends MapReduceBase
implements Reducer<Text, NeighborWritable, Text, Text> {
#Override
public void reduce(Text key,
Iterator<NeighborWritable> values,
OutputCollector<Text, Text> output,
Reporter arg3)
throws IOException {
ArrayList<NeighborWritable> neighborList = new ArrayList<NeighborWritable>();
while(values.hasNext()) {
neighborList.add(values.next());
}
NeighborArrayWritable neighbors = new NeighborArrayWritable
(neighborList.toArray(new NeighborWritable[0]));
Text out = new Text(neighbors.toString());
output.collect(key, out);
}
}
#Override
public int run(String[] arg0) throws Exception {
JobConf conf = Util.getMapRedJobConf("testJob",
SequenceFileInputFormat.class,
TTMapper.class,
Text.class,
NeighborWritable.class,
1,
TTReducer.class,
Text.class,
Text.class,
TextOutputFormat.class,
"test/in",
"test/out");
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new TestTask(), args);
System.exit(res);
}
}
The auxiliary code is following:
TextArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
public TextArrayWritable(Text[] values) {
super(Text.class, values);
}
}
NeighborWritable:
public class NeighborWritable implements Writable {
private Text nodeId;
private DoubleWritable weight;
public NeighborWritable(Text nodeId, DoubleWritable weight) {
this.nodeId = nodeId;
this.weight = weight;
}
public NeighborWritable () { }
public Text getNodeId() {
return nodeId;
}
public DoubleWritable getWeight() {
return weight;
}
public void setNodeId(Text nodeId) {
this.nodeId = nodeId;
}
public void setWeight(DoubleWritable weight) {
this.weight = weight;
}
#Override
public void readFields(DataInput in) throws IOException {
nodeId = new Text();
nodeId.readFields(in);
weight = new DoubleWritable();
weight.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
nodeId.write(out);
weight.write(out);
}
public String toString() {
return "NW[nodeId=" + (nodeId != null ? nodeId.toString() : "(null)") +
",weight=" + (weight != null ? weight.toString() : "(null)") + "]";
}
public boolean equals(Object o) {
if (!(o instanceof NeighborWritable)) {
return false;
}
NeighborWritable that = (NeighborWritable)o;
return (nodeId.equals(that.getNodeId()) && (weight.equals(that.getWeight())));
}
}
and the Util class:
public class Util {
public static JobConf getMapRedJobConf(String jobName,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass,
Class<?> mapOutputKeyClass,
Class<?> mapOutputValueClass,
int numReducer,
Class<? extends Reducer> reducerClass,
Class<?> outputKeyClass,
Class<?> outputValueClass,
Class<? extends OutputFormat> outputFormatClass,
String inputDir,
String outputDir) throws IOException {
JobConf conf = new JobConf();
if (jobName != null)
conf.setJobName(jobName);
conf.setInputFormat(inputFormatClass);
conf.setMapperClass(mapperClass);
if (numReducer == 0) {
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
} else {
// may set actual number of reducers
// conf.setNumReduceTasks(numReducer);
conf.setMapOutputKeyClass(mapOutputKeyClass);
conf.setMapOutputValueClass(mapOutputValueClass);
conf.setReducerClass(reducerClass);
conf.setOutputKeyClass(outputKeyClass);
conf.setOutputValueClass(outputValueClass);
conf.setOutputFormat(outputFormatClass);
}
// delete the existing target output folder
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(outputDir), true);
// specify input and output DIRECTORIES (not files)
FileInputFormat.addInputPath(conf, new Path(inputDir));
FileOutputFormat.setOutputPath(conf, new Path(outputDir));
return conf;
}
}
My input is following graph: (in binary format, here I am giving the text format)
1 2
2 1,3,5
3 2,4
4 3,5
5 2,4
According to the logic of the code the output should be:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],NW[nodeId=1,weight=1.0],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=3,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=4,weight=0.5],}]
But the output is coming as:
1 NWArray[size=1,{NW[nodeId=2,weight=0.3333333333333333],}]
2 NWArray[size=3,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
3 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
4 NWArray[size=2,{NW[nodeId=5,weight=0.5],NW[nodeId=5,weight=0.5],}]
5 NWArray[size=2,{NW[nodeId=2,weight=0.3333333333333333],NW[nodeId=2,weight=0.3333333333333333],}]
I cannot understand the reason why the expected output is not coming out. Any help will be appreciated.
Thanks.
You're falling foul of object re-use
while(values.hasNext()) {
neighborList.add(values.next());
}
values.next() will return the same object reference, but the underlying contents of that object will change for each iteration (the readFields method is called to re-populate the contents)
Suggest you amend to (you'll need to obtain the Configuration conf variable from a setup method, unless you can obtain it from the Reporter or OutputCollector - sorry i don't use the old API)
while(values.hasNext()) {
neighborList.add(
ReflectionUtils.copy(conf, values.next(), new NeighborWritable());
}
But I still can't understand why my unit test passed then. Here is the code -
public class UWLTInitReducerTest {
private Text key;
private Iterator<NeighborWritable> values;
private NeighborArrayWritable nodeData;
private TTReducer reducer;
/**
* Set up the states for calling the map function
*/
#Before
public void setUp() throws Exception {
key = new Text("1001");
NeighborWritable[] neighbors = new NeighborWritable[4];
for (int i = 0; i < 4; i++) {
neighbors[i] = new NeighborWritable(new Text("300" + i), new DoubleWritable((double) 1 / (1 + i)));
}
values = Arrays.asList(neighbors).iterator();
nodeData = new NeighborArrayWritable(neighbors);
reducer = new TTReducer();
}
/**
* Test method for InitModelMapper#map - valid input
*/
#Test
public void testMapValid() {
// mock the output object
OutputCollector<Text, UWLTNodeData> output = mock(OutputCollector.class);
try {
// call the API
reducer.reduce(key, values, output, null);
// in order (sequential) verification of the calls to output.collect()
verify(output).collect(key, nodeData);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Why didn't this code catch the bug?

Using a custom Object as key emitted by mapper

I have a situation in which mapper emits as key an object of custom type.
It has two fields an intWritable ID, and a data array IntArrayWritable.
The implementation is as follows.
`
import java.io.*;
import org.apache.hadoop.io.*;
public class PairDocIdPerm implements WritableComparable<PairDocIdPerm> {
public PairDocIdPerm(){
this.permId = new IntWritable(-1);
this.SignaturePerm = new IntArrayWritable();
}
public IntWritable getPermId() {
return permId;
}
public void setPermId(IntWritable permId) {
this.permId = permId;
}
public IntArrayWritable getSignaturePerm() {
return SignaturePerm;
}
public void setSignaturePerm(IntArrayWritable signaturePerm) {
SignaturePerm = signaturePerm;
}
private IntWritable permId;
private IntArrayWritable SignaturePerm;
public PairDocIdPerm(IntWritable permId,IntArrayWritable SignaturePerm) {
this.permId = permId;
this.SignaturePerm = SignaturePerm;
}
#Override
public void write(DataOutput out) throws IOException {
permId.write(out);
SignaturePerm.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
permId.readFields(in);
SignaturePerm.readFields(in);
}
#Override
public int hashCode() { // same permId must go to same reducer. there fore just permId
return permId.get();//.hashCode();
}
#Override
public boolean equals(Object o) {
if (o instanceof PairDocIdPerm) {
PairDocIdPerm tp = (PairDocIdPerm) o;
return permId.equals(tp.permId) && SignaturePerm.equals(tp.SignaturePerm);
}
return false;
}
#Override
public String toString() {
return permId + "\t" +SignaturePerm.toString();
}
#Override
public int compareTo(PairDocIdPerm tp) {
int cmp = permId.compareTo(tp.permId);
Writable[] ar, other;
ar = this.SignaturePerm.get();
other = tp.SignaturePerm.get();
if (cmp == 0) {
for(int i=0;i<ar.length;i++){
if(((IntWritable)ar[i]).get() == ((IntWritable)other[i]).get()){cmp= 0;continue;}
else if(((IntWritable)ar[i]).get() < ((IntWritable)other[i]).get()){ return -1;}
else if(((IntWritable)ar[i]).get() > ((IntWritable)other[i]).get()){return 1;}
}
}
return cmp;
//return 1;
}
}`
I require the keys with same Id to go to the same reducer with their sort order as coded in the compareTo method.
However when i use this, my job execution status is always map100% reduce 0%.
The reduce never runs to completion. Is there any thing wrong in this implementation?
In general what is the likely problem if reducer status is always 0%.
I think this might be a possible null pointer exception in the read method:
#Override
public void readFields(DataInput in) throws IOException {
permId.readFields(in);
SignaturePerm.readFields(in);
}
permId is null in this case.
So what you have to do is this:
IntWritable permId = new IntWritable();
Either in the field initializer or before the read.
However, your code is horrible to read.

Resources