How to perform windowed aggregations on top of windowed aggregations - apache-kafka-streams

Tl;dr:
Why is there no analogue of no analogue of TimeWindowedKStream for KTable?
How can I do a windowed aggregation of a KTable with the current API?
a) Map Windowed<MyKey> to new Timescale and treat this as normal aggregation
b) Convert Ktable to Stream and do the state management manually
c) Something else
d) Please don't
We have a infinite retention topic in Kafka with "financial transactions" which is a stream of events that contain a BigDecimal revenue value and a few metadata. To speed up our reporting we would like to pre-aggregate these revenues on different window durations: Hour, Day, Month, and Year.
The calculation of the events is pretty standard:
.groupBy((key, value) -> new MyKey(value.getA(), value.getB()),
Grouped.with(myKeySerde, financialTransactionSerde))
.windowedBy(TimeWindows.of(WINDOW_DURATION).grace(GRACE_PERIOD))
.aggregate(BigDecimalSummaryStatistics::new,
(aggKey, newValue, aggValue) -> {
aggValue.accept(newValue.getRevenue());
return aggValue;
},
Materialized.<MyKey, BigDecimalSummaryStatistics, WindowStore<Bytes, byte[]>>
with(myKeySerde, revenueSerde)
.withRetention(RETENTION_PERIOD))
BigDecimalSummaryStatistics is a custom class in analogy to java.util.DoubleSummaryStatistics because we are also interested in min, max and count. The name for the key is highly abbreviated for readability.
(Comments on the code are always appreciated.)
The output is obviously a KTable which we write into a compacted topic. This "hourly-revenue" topic is
Produced.with(new WindowedSerdes.TimeWindowedSerde<>(myKeySerde), revenueSerde)
Now to the fun part: I would like to perform the aggregations for coarser timescales on the output of the hourly aggregation. This has several reasons:
Workload: If the topic is already compacted or we decide to use suppression, the number of events to process are far fewer
Schema: The source topic is quite messed up (think multiple schemas and no schema registry), outside our control to modify and too large to persist in a "clean" topic. Currently we deserialize that as byte[] and have a huge and costly Transformer at the beginning of the topology that unifies the schema in-place (Feedback apprciated). So it would be a huge advantage to only go through that mess once.
However, for Ktable there is no analogue of TimeWindowedKStream. The graphic below is taken from the developer guide and illustrates this point rather explictly: KGroupedTable aggregations are
Always non-windowed
The fact, that it is not allowed this explicitly makes me believe there is something I am overlooking, but I don't see a conceptional problem. For windowed Aggregation I need to maintain a WindowStore with all events that fall into a window. Couldn't this be fed from a Ktable with an upsert instead of an insert just as well? Is there a reason why windowing on a KTable cannot work?
And finally: How can I work around this? I see two ways:
map Windowed<MyKey> to Windowed<MyKey> and replace the startMs by
windowed.window().startTime().atZone(GERMAN_TIME_ZONE_ID).truncatedTo(ChronoUnit.DAYS).toInstant().toEpochMilli()
In this case, I need to write a custom transformer to properly implement the grace period and I also have to use a `KeyValueStore` rather than a `WindowStore`. In essence, I need to manually to everything `windowedBy` would do under the hood. I already implemented this and it passes all integration tests, but it feels like it's not supposed to be done this way.
convert the KTable back to KStream and use a windowed aggregation again. In this case I would need keep the state of the Kstream saved somewhere, maybe a HashMap inside the aggregate function to implement the semantics of an updatestream. I tried this before. It was slow, buggy and extremely hard to maintain but it worked. I find this solution terrible because the only reason KTable exists is so that I don't do something like this.
Otherwise I could just bite the bullet and do the aggregation on the raw data directly. I described the drawbacks above already.
To summarize again:
Why is there no analogue of no analogue of TimeWindowedKStream for KTable?
How can I do a windowed aggregation of a KTable with the current API?
a) Map Windowed<MyKey> to new Timescale and treat this as normal aggregation
b) Convert Ktable to Stream and do the state management manually
c) Something else
d) Please don't
I would appreciate all feedback and ideas!

Related

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Can I join the KTable produced by a Kafka Streams #aggregate call before the aggregation runs?

I have a number of IOT devices that report events via messages to a kafka topic, and I have defined an aggregator to update the device state from those events.
What I'd like to do is be able to join the input stream to the KTable that the aggregator outputs before the aggregation updates the state-- that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
I've tried creating the state store with StreamsBuilder#addStateStore first, but that method returns a StreamsBuilder, and doesn't seem to provide me a way to turn it into a KTable.
I've tried joining the input stream against the KTable produced by StreamsBuilder#aggregate, but that doesn't do what I want, because it only gives me the value in the KTable after the aggregation has run, and I'd like it to run before the aggregation.
// this is fine, but it returns a StreamsBuilder and I don't see how to get a KTable out of it
streamsBuilder.addStateStore(
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(deviceStateAggregator),
Serdes.String(),
Serdes.String()
)
);
// this doesn't work because I only get doThingsBeforeStateUpdate called after the state is updated by the DeviceStateAggregator
KTable<String, DeviceState> deviceTable = deviceEventKStream
.groupByKey(Serialized.with(Serdes.String(), new deviceEventSerde()))
.aggregate(
() -> null,
new DeviceStateAggregator(),
Materialized.<String, DeviceState>as(stateStoreSupplier)
.withValueSerde(deviceStateSerde)
);
deviceEventKStream.join(deviceTable, (event, state) -> doThingsBeforeStateUpdate(event, state));
I was hoping to be able to exploit the Streams DSL to check some preconditions before the state is updated by the aggregator, but it doesn't seem possible. I'm currently exploring the idea of using a Processor, or perhaps just extending my DeviceStateAggregator to do all the pre-aggregation processing as well, but that feels awkward to me, as it forces the aggregation to care about concerns that don't seem reasonable to do as part of the aggregation.
that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
If I understand your question and notably this quote correctly, then I'd follow your idea to use the Processor API to implement this. You will need to implement a Transformer (as you want it to output data, not just read it).
As an example application that you could use as a starting point I'd recommend to look at the MixAndMatch DSL + Processor API and the CustomStreamTableJoin examples at https://github.com/confluentinc/kafka-streams-examples. The second example shows, though for a different use case, how to do custom "if this then that" logic when working with state in the Processor API, plus it also covers join functionality, which is something you want to do, too.
Hope this helps!

Using my own Cassandra driver to write aggregations results

I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.
My code for this looks something like this:
KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
.groupByKey()
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")
Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.
My questions now are two:
Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.
I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.
Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).
You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.
Compare: External system queries during Kafka Stream processing
2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources