kafka stream make a local aggregation - apache-kafka-streams

I am trying to make a local aggregation.
The input topic has records containing multiple elements and I am using flatmap to split the record into multiple records with another key (here element_id). This triggers a re-partition as I am applying a grouping for aggregation later in the stream process.
Problem: there are way too many records in this repartition topic and the app cannot handle them (lag is increasing).
Here is a example of the incoming data
key: another ID
value:
{
"cat_1": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 0
},
"cat_2": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 1
}
}
And an example of the wanted aggregation result:
key : element_2
value:
{
"cat_1": 1,
"cat_2": 1
}
So I would like to make a first "local aggregation" and stop splitting incoming records, meaning that I want to aggregate all elements locally (no re-partition) for example in a 30 seconds window, then produce result per element in a topic. A stream consuming this topic later aggregates at a higher level.
I am using Stream DSL, but I am not sure it is enough. I tried to use the process() and transform() methods that allow me to benefit from the Processor API, but I don't known how to properly produce some records in a punctuation, or put records in a stream.
How could I achieve that ? Thank you

transform() returns a KStream on which you can call to() to write the results into a topic.
stream.transform(...).to("output_topic");
In a punctuation you can call context.forward() to send a record downstream. You still need to call to() to write the forwarded record into a topic.
To implement a custom aggregation consider the following pseudo-ish code:
builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<Integer, Integer>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName),
Serdes.Integer(),
Serdes.Integer());
builder.addStateStore(keyValueStoreBuilder);
stream = builder.stream(topic, Consumed.with(Serdes.Integer(), Serdes.Integer()));
stream.transform(() ->
new Transformer<Integer, Integer, KeyValue<Integer, Integer>>() {
private KeyValueStore<Integer, Integer> state;
#Override
public void init(final ProcessorContext context) {
state = (KeyValueStore<Integer, Integer>) context.getStateStore(stateStoreName);
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
timestamp -> {
// You can get aggregates from the state store here
// Then you can send the aggregates downstream
// with context.forward();
// Alternatively, you can output the aggregate in the
// transform() method as shown below
}
);
}
#Override
public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
// Get existing aggregates from the state store with state.get().
// Update aggregates and write them into the state store with state.put().
// Depending on some condition, e.g., 10 seen records,
// output an aggregate downstream by returning the output.
// You can output multiple aggregates by using KStream#flatTransform().
// Alternatively, you can output the aggregate in a
// punctuation as shown above
}
#Override
public void close() {
}
}, stateStoreName)
With this manual aggregation you could implement the higher level aggregation in the same streams app and leverage re-partitioning.
process() is a terminal operation, i.e., it does not return anything.

Related

Collect Uni.combine failures

Is there a way to collect Uni.combine().all().unis(...) failures, as Uni.join().all(...).andCollectFailures() does?
I need to call different services concurrently (with heterogeneous results) and fail all if one of them fails.
Moreover, what's the difference between Uni.combine().all().unis(...) and Uni.join(...) ?
Uni Combine Exception
The code should be like this,
return Uni.combine().all().unis(getObject1(), getObject2()).collectFailures().asTuple().flatMap(tuples -> {
return Uni.createFrom().item(Response.ok().build());
}).onFailure().recoverWithUni(failures -> {
System.out.println(failures instanceof CompositeException);
CompositeException exception = (CompositeException) failures;
for (Throwable error : exception.getCauses()) {
System.out.println(error.toString());
}
// failures.printStackTrace();
return Uni.createFrom().item(Response.status(500).build());
});
Difference:
Quarkus provides parallel processing through use of these two features:
Unijoin - Iterate through a list of objects and perform a certain
operation in parallel.
Iterate through a list of orders and perform activities on each order one by one (Multi)
OR
Iterate through a list of orders and add a method wrapper on the object in the Unijoin Builder. When the builder executes, the method wrappers are called in parallel, and its response collected in a list.
List<RequestDTO> reqDataList = request.getRequestData(); // Your input data
UniJoin.Builder<ResponseDTO> builder = Uni.join().builder();
for (RequestDTO requestDTO : reqDataList) {
builder.add(process(requestDTO));
}
return builder.joinAll().andFailFast().flatMap(responseList -> {
List<ResponseDTO> nonNullList = new ArrayList<>();
nonNullList.addAll(responseList.stream().filter(respDTO -> {
return respDTO != null;
}).collect(Collectors.toList()));
return Uni.createFrom().item(nonNullList);
});
You can see the list of objects converted to method wrapper, 'process', which is then called in parallel when 'andFailFast' is called.
'Uni.combine' - Call separate methods that
returns different response in parallel.
List<OrderDTO> orders = new ArrayList<>();
return Uni.combine().all().unis(getCountryMasters(),
getCurrencyMasters(updateDto)).asTuple()
.flatMap(tuple -> {
List<CountryDto> countries = tuple.getItem1();
List<CurrencyDto> currencies = tuple.getItem2();
// Get country code and currency code from each order and
// convert it to corresponding technical id.
return convert(orders, countries, currencies);
});
As you see above, both methods in 'combine' returns different results and yet they are performed in parallel.

Query KTable in the same Application where it is created

I have an Kafka streams application in which I read from a topic, do aggregation and materialize in a KTable. I then create a Stream and run some logic on the stream. Now in the stream processing, I want to use some data from the aforementioned KTable. Once I start the stream app, how do I get access to the KTable stream again? I don't want to push the KTable to a new Topic.
KStream<String, MyClass> source = builder.stream("my-topic");
KTable<Windowed<String>, Long> kTable =
source.groupBy((key, value) -> value.getKey(),
Grouped.<String, MyClass >as("repartition-1")
.withKeySerde(new Serdes.String())
.withValueSerde(new MyClassSerDes()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("test-store")
.withKeySerde(new Serdes.String())
.withValueSerde(Serdes.Long()));
Here I want to use data from the kTable.
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30)))
.toStream()
.filter((k, v) -> {
// Here get the count for the previous Window.
// Use that count for some computation here.
}
You can add the KTable store to a processor/transformer. For you case, you can replace the filter with flatTransform (or any sibling like transform etc depending if you need access to the key) and connect the store to the operator:
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30))
)
.toStream()
// requires v2.2; otherwise use `transform()`
// if you don't need access to the key, consider to use `flatTransformValues` (v2.3)
.flatTransform(
() -> new Transformer<Windowed<myKey>,
Long,
List<KeyValue<Windowed<myKey>, Long>>() {
private ReadOnlyWindowStore<myKey, Long> store;
public void init(final ProcessorContext context) {
// get a handle on the store by its name
// as specified via `Materialized` above;
// should be read-only
store = (ReadOnlyWindowStore<myKey, Long>)context.getStateStore("str");
}
public List<KeyValue<Windowed<myKey>, Long>> transform(Windowed<myKey> key,
Long value) {
// access `store` as you wish to make a filtering decision
if ( ... ) {
// record passes
return Collection.singletonList(KeyValue.pair(key, value));
} else {
// drop record
return Collection.emptyList();
}
}
public void close() {} // nothing to do
},
"str" // connect the KTable store to the transformer using its name
// as specified via `Materialized` above
);

Filtering data bolt Storm

I have a simple Storm topology which reads the data from Kafka, parses and extracts message fields. I would like to filter the stream of tuples by one of the fields values and perform counting aggregation on another one. How can I do this in Storm?
I haven't found respective methods for tuples (filter, aggregate) so should I perform these functions directly on the field values?
Here is a topology:
topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1)
topologyBuilder.setBolt("parser_bolt", new ParserBolt()).shuffleGrouping("kafka_spout")
topologyBuilder.setBolt("transformer_bolt", new KafkaTwitterBolt()).shuffleGrouping("parser_bolt")
val config = new Config()
cluster.submitTopology("kafkaTest", config, topologyBuilder.createTopology())
I have set up KafkaTwitterBolt for counting and filtering with parsed fields. I've managed to filter the whole list of values only not by specific field:
class KafkaTwitterBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val tweetValues = input.getValues.asScala.toList
val filterTweets = tweetValues
.map(_.toString)
.filter(_ contains "big data")
val resultAllValues = new Values(filterTweets)
collector.emit(resultAllValues)
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.lang", "user.favorite_count", "entities.hashtags"))
}
}
Turns out Storm core API does not allows that, in order to perform filtering on any field Trident should be used (it has built-in filter function).
The code would look like this:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.filter(new LanguageFilter)
Filtering function itself:
class LanguageFilter extends BaseFilter{
override def isKeep(tuple: TridentTuple): Boolean = {
val language = tuple.getStringByField("user.lang")
println(s"TWEET: $language")
language.contains("en")
}
}
Your answer at https://stackoverflow.com/a/59805582/8845188 is a little wrong. The Storm core API does allow filtering and aggregation, you just have to write the logic yourself.
A filtering bolt is just a bolt that discards some tuples, and passes others on. For instance, the following bolt will filter out tuples based on a string field:
class FilteringBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val values = input.getValues.asScala.toList
if ("Pass me".equals(values.get(0))) {
collector.emit(values)
}
//Emitting nothing means discarding the tuple
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("some-field"))
}
}
An aggregating bolt is just a bolt that collects multiple tuples, and then emits a new aggregate tuple anchored in the original tuples:
class AggregatingBolt extends BaseRichBolt {
List<Tuple> tuplesToAggregate = ...;
int counter = 0;
override def execute(input: Tuple): Unit = {
tuplesToAggregate.add(input);
counter++;
if (counter == 10) {
Values aggregateTuple = ... //create a new set of values based on tuplesToAggregate
collector.emit(tuplesToAggregate, aggregateTuple) //This anchors the new aggregate tuple to all the original tuples, so if the aggregate fails, the original tuples are replayed.
for (Tuple t : tuplesToAggregate) {
collector.ack(t); //Ack the original tuples now that this bolt is done with them
//Note that you MUST emit before you ack, or the at-least-once guarantee will be broken.
}
tuplesToAggregate.clear();
counter = 0;
}
//Note that we don't ack the input tuples until the aggregate gets emitted. This lets us replay all the aggregated tuples in case the aggregate fails
}
}
Note that for aggregation, you will need to extend BaseRichBolt and do acking manually, since you want to delay acking a tuple until it has been included in an aggregate tuple.

Save Multiple Records using Web Flux and Mongo DB

I'm working on a project which uses Spring web Flux and mongo DB and I'm very new to reactive programming and WebFlux.
I have scenario of saving into 3 collections using one service. For each collection im generating id using a sequence and then save them. I have FieldMaster which have List on them and every Field Info has List . I need to save FieldMaster, FileInfo and FieldOption. Below is the Code i'm using. The code works only when i'm running on debugging mode, otherwise it get blocked on below line
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue());
Here is the full code
public Mono< FieldMaster > createMasterData(Mono< FieldMaster > fieldmaster)
{
return fieldmaster.flatMap(fm -> {
return sequencesCollection.getNextSequence(FIELDMASTER).flatMap(seqVal -> {
LOGGER.info("Generated Sequence value :" + seqVal.getSeqValue());
fm.setId(Integer.parseInt(seqVal.getSeqValue()));
List<FieldInfo> fieldInfo = fm.getFieldInfo();
fieldInfo.forEach(field -> {
// saving Field Goes Here
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue()); // stops execution at this line
LOGGER.info("Generated Sequence value Field Sequence:" + field_seq_id);
field.setId(field_seq_id);
field.setMasterFieldRefId(fm.getId());
mongoTemplate.save(field).block();
LOGGER.info("Field Details Saved");
List<FieldOption> fieldOption = field.getFieldOptions();
fieldOption.forEach(option -> {
// saving Field Option Goes Here
Integer opt_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDOPTION).block().getSeqValue());
LOGGER.info("Generated Sequence value Options Sequence:" + opt_seq_id);
option.setId(opt_seq_id);
option.setFieldRefId(field_seq_id);
mongoTemplate.save(option).log().block();
LOGGER.info("Field Option Details Saved");
});
});
return mongoTemplate.save(fm).log();
});
});
}
First in reactive programming is not good to use .block because you turn nonblocking code to blocking. if you want to get from a stream and save in 3 streams you can do like that.
There are many different ways to do that for performance purpose but it depends of the amount of data.
here you have a sample using simple data and using concat operator but there are even zip and merge. it depends from your needs.
public void run(String... args) throws Exception {
Flux<Integer> dbData = Flux.range(0, 10);
dbData.flatMap(integer -> Flux.concat(saveAllInFirstCollection(integer), saveAllInSecondCollection(integer), saveAllInThirdCollection(integer))).subscribe();
}
Flux<Integer> saveAllInFirstCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInSecondCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInThirdCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}

How to handle duplicate messages using Kafka streaming DSL functions

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}

Resources