I have a simple Storm topology which reads the data from Kafka, parses and extracts message fields. I would like to filter the stream of tuples by one of the fields values and perform counting aggregation on another one. How can I do this in Storm?
I haven't found respective methods for tuples (filter, aggregate) so should I perform these functions directly on the field values?
Here is a topology:
topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1)
topologyBuilder.setBolt("parser_bolt", new ParserBolt()).shuffleGrouping("kafka_spout")
topologyBuilder.setBolt("transformer_bolt", new KafkaTwitterBolt()).shuffleGrouping("parser_bolt")
val config = new Config()
cluster.submitTopology("kafkaTest", config, topologyBuilder.createTopology())
I have set up KafkaTwitterBolt for counting and filtering with parsed fields. I've managed to filter the whole list of values only not by specific field:
class KafkaTwitterBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val tweetValues = input.getValues.asScala.toList
val filterTweets = tweetValues
.map(_.toString)
.filter(_ contains "big data")
val resultAllValues = new Values(filterTweets)
collector.emit(resultAllValues)
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.lang", "user.favorite_count", "entities.hashtags"))
}
}
Turns out Storm core API does not allows that, in order to perform filtering on any field Trident should be used (it has built-in filter function).
The code would look like this:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.filter(new LanguageFilter)
Filtering function itself:
class LanguageFilter extends BaseFilter{
override def isKeep(tuple: TridentTuple): Boolean = {
val language = tuple.getStringByField("user.lang")
println(s"TWEET: $language")
language.contains("en")
}
}
Your answer at https://stackoverflow.com/a/59805582/8845188 is a little wrong. The Storm core API does allow filtering and aggregation, you just have to write the logic yourself.
A filtering bolt is just a bolt that discards some tuples, and passes others on. For instance, the following bolt will filter out tuples based on a string field:
class FilteringBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val values = input.getValues.asScala.toList
if ("Pass me".equals(values.get(0))) {
collector.emit(values)
}
//Emitting nothing means discarding the tuple
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("some-field"))
}
}
An aggregating bolt is just a bolt that collects multiple tuples, and then emits a new aggregate tuple anchored in the original tuples:
class AggregatingBolt extends BaseRichBolt {
List<Tuple> tuplesToAggregate = ...;
int counter = 0;
override def execute(input: Tuple): Unit = {
tuplesToAggregate.add(input);
counter++;
if (counter == 10) {
Values aggregateTuple = ... //create a new set of values based on tuplesToAggregate
collector.emit(tuplesToAggregate, aggregateTuple) //This anchors the new aggregate tuple to all the original tuples, so if the aggregate fails, the original tuples are replayed.
for (Tuple t : tuplesToAggregate) {
collector.ack(t); //Ack the original tuples now that this bolt is done with them
//Note that you MUST emit before you ack, or the at-least-once guarantee will be broken.
}
tuplesToAggregate.clear();
counter = 0;
}
//Note that we don't ack the input tuples until the aggregate gets emitted. This lets us replay all the aggregated tuples in case the aggregate fails
}
}
Note that for aggregation, you will need to extend BaseRichBolt and do acking manually, since you want to delay acking a tuple until it has been included in an aggregate tuple.
Related
I have a class with more than 30 observable attributes. Each time my server receives a payload containing these 30 attributes I call the next() method for all the corresponding attributes of the instance, so far so good.
The problem is that, sometimes, I have to check for an attribute's value, outside the scope of the observer that subscribed to that observable attribute.
What comes to mind is that I have to have duplicate attributes for everything, one is the observable and the other one is a stateful attribute to save the arriving values for later consumption.
Is there some way to avoid this with a method like: Observable.getCurrentValue()?
As requested, some example code
class Example {
public subjects = {
a1: new Subject<any>(),
a2: new Subject<any>(),
a3: new Subject<any>(),
a4: new Subject<any>(),
a5: new Subject<any>()
}
public treatPayload(data: any) {
for (const prop in data) {
if (data.hasOwnProperty(prop) && prop in this.subjects){
Reflect.get(this.subjects, prop).next(data[prop])
}
}
}
public test() {
const a1_observable = this.subjects.a1.asObservable()
const a2_observable = this.subjects.a2.asObservable()
const example_payload_1 = {
a1: "first",
a2: "second",
a10: "useless"
}
const example_payload_2 = {
a1: "first-second",
a2: "second-second",
a10: "useless-second"
}
a1_observable.subscribe((a1_new_value: any) => {
const i_also_want_the_last_value_emitted_by_a2 = a2_observable.last_value() // of course, this doesn't exist
console.log(a1_new_value)
console.log(i_also_want_the_last_value_emitted_by_a2)
})
this.treatPayload(example_payload_1)
this.treatPayload(example_payload_2)
}
}
So, is there a way to retrieve the correct value of i_also_want_the_last_value_emitted_by_a2 without a pipe operator? I think it would be a problem to emit all values I could possibly use in a subscriber within a pipe of the a2_observable.
You could use BehaviorSubject.value, where you could store your server data.
I have processed a "wrapperObject" (AimResponse in this case).
Depending on the property "type" I map to Document or SourceSpace object.
Then I need to persist these entities. I found an example similar to this one:
#Override
public void write(List<? extends List<AimResponse>> list)
throws Exception {
List<SourceSpace> sourceSpaces = new ArrayList<>();
List<Document> documents = new ArrayList<>();
for(List<AimResponse> item:list) {
for(AimResponse i:item) {
if(i.getType().indexOf("folder") >= 0) {
SourceSpace sourceSpace = Mapper.aimResponseToSourceSpace(i);
sourceSpace.setStatus(Status.FOUND.name());
sourceSpaces.add(sourceSpace);
} else if(i.getType().indexOf("document") >= 0) {
Document document = Mapper.aimResponseToDocument(i);
document.setStatus(Status.FOUND.name());
documents.add(document);
}
}
}
if(!CollectionUtils.isEmpty(sourceSpaces)) {
sourceSpaceWriter.write(sourceSpaces);
}
if(!CollectionUtils.isEmpty(documents)) {
documentWriter.write(documents);
}
}
In this example I'm not able to instantiate JdbcBatchItemWriter but anyway I think should be better if the processor could split into 2 different lists and call 2 different writers each one with its own type but I guess it's not possible.
Any help is appreciated.
ClassifierCompositeItemWriter is what you are looking for. It allows you to classify items according to a given criteria and call the corresponding writer.
In your case, you can classify items based on their type (i.getType()) and use a writer for each type. You can find an example of how to use that writer here.
I am trying to make a local aggregation.
The input topic has records containing multiple elements and I am using flatmap to split the record into multiple records with another key (here element_id). This triggers a re-partition as I am applying a grouping for aggregation later in the stream process.
Problem: there are way too many records in this repartition topic and the app cannot handle them (lag is increasing).
Here is a example of the incoming data
key: another ID
value:
{
"cat_1": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 0
},
"cat_2": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 1
}
}
And an example of the wanted aggregation result:
key : element_2
value:
{
"cat_1": 1,
"cat_2": 1
}
So I would like to make a first "local aggregation" and stop splitting incoming records, meaning that I want to aggregate all elements locally (no re-partition) for example in a 30 seconds window, then produce result per element in a topic. A stream consuming this topic later aggregates at a higher level.
I am using Stream DSL, but I am not sure it is enough. I tried to use the process() and transform() methods that allow me to benefit from the Processor API, but I don't known how to properly produce some records in a punctuation, or put records in a stream.
How could I achieve that ? Thank you
transform() returns a KStream on which you can call to() to write the results into a topic.
stream.transform(...).to("output_topic");
In a punctuation you can call context.forward() to send a record downstream. You still need to call to() to write the forwarded record into a topic.
To implement a custom aggregation consider the following pseudo-ish code:
builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<Integer, Integer>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName),
Serdes.Integer(),
Serdes.Integer());
builder.addStateStore(keyValueStoreBuilder);
stream = builder.stream(topic, Consumed.with(Serdes.Integer(), Serdes.Integer()));
stream.transform(() ->
new Transformer<Integer, Integer, KeyValue<Integer, Integer>>() {
private KeyValueStore<Integer, Integer> state;
#Override
public void init(final ProcessorContext context) {
state = (KeyValueStore<Integer, Integer>) context.getStateStore(stateStoreName);
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
timestamp -> {
// You can get aggregates from the state store here
// Then you can send the aggregates downstream
// with context.forward();
// Alternatively, you can output the aggregate in the
// transform() method as shown below
}
);
}
#Override
public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
// Get existing aggregates from the state store with state.get().
// Update aggregates and write them into the state store with state.put().
// Depending on some condition, e.g., 10 seen records,
// output an aggregate downstream by returning the output.
// You can output multiple aggregates by using KStream#flatTransform().
// Alternatively, you can output the aggregate in a
// punctuation as shown above
}
#Override
public void close() {
}
}, stateStoreName)
With this manual aggregation you could implement the higher level aggregation in the same streams app and leverage re-partitioning.
process() is a terminal operation, i.e., it does not return anything.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}
I have a delegate that takes two numbers and creates a System.Windows.Point from them:
(x, y) => new Point(x,y);
I want to learn how can I use TPL Dataflow, specifically TransformBlock, to perform that.
I would have something like this:
ISourceBlock<double> Xsource;
ISourceBlock<double> Ysource;
ITargetBlock<Point> PointTarget;
// is there such a thing?
TransformBlock<double, double, Point> PointCreatorBlock;
// and also, how should I wire them together?
UPDATE:
Also, how can I assemble a network that joins more than two arguments? For example, let's say I have a method that receives eight arguments, each one coming from a different buffer, how can I create a block that knows when every argument has one instance available so that the object can be created?
I think what your looking for is the join block. Currently there is a two input and a three input variant, each outputs a tuple. These could be combined to create an eight parameter result. Another method would be creating a class to hold the parameters and using various block to process and construct the parameters class.
For the simple example of combining two ints for a point:
class MyClass {
BufferBlock<int> Xsource;
BufferBlock<int> Ysource;
JoinBlock<int, int> pointValueSource;
TransformBlock<Tuple<int, int>, Point> pointProducer;
public MyClass() {
CreatePipeline();
LinkPipeline();
}
private void CreatePipeline() {
Xsource = new BufferBlock<int>();
Ysource = new BufferBlock<int>();
pointValueSource = new JoinBlock<int, int>(new GroupingDataflowBlockOptions() {
Greedy = false
});
pointProducer = new TransformBlock<Tuple<int, int>, Point>((Func<Tuple<int,int>,Point>)ProducePoint,
new ExecutionDataflowBlockOptions()
{ MaxDegreeOfParallelism = Environment.ProcessorCount });
}
private void LinkPipeline() {
Xsource.LinkTo(pointValueSource.Target1, new DataflowLinkOptions() {
PropagateCompletion = true
});
Ysource.LinkTo(pointValueSource.Target2, new DataflowLinkOptions() {
PropagateCompletion = true
});
pointValueSource.LinkTo(pointProducer, new DataflowLinkOptions() {
PropagateCompletion = true
});
//pointProduce.LinkTo(Next Step In processing)
}
private Point ProducePoint(Tuple<int, int> XandY) {
return new Point(XandY.Item1, XandY.Item2);
}
}
The JoinBlock will wait until it has data available on both of its input buffers to produce an output. Also, note that in this case if X's and Y's are arriving out of order at the input buffers care needs to be taken to re-sync them. The join block will only combine the first X and the first Y value it receives and so on.