Kafka streams window batching - apache-kafka-streams

Coming from a Spark Streaming background - getting a grasp on Kafka streams.
I have a simple Spark Streaming application that reads from Kafka,
and returns the latest event per user in that minute
Sample events would look like {"user": 1, "timestamp": "2018-05-18T16:56:30.754Z", "count": 3}, {"user": 1, "timestamp": "2018-05-22T16:56:39.754Z", "count": 4}
I'm interested in how this would work in Kafka Streams, as it seems that there is an output for each event - when my use case is to reduce the traffic.
From my reading so far it seems that this is not straight forward, and you would have to use the processor api.
Ideally I would like to use the DSL instead of the processor API, as I am just starting to look at Kafka streams, but it seems that I would have to use the processor API's punctuate method to read from a state store every n seconds?
I'm using kafka 0.11.0

At DSL level, Kafka Streams allows to configure KTable caches (enabled by default) that reduce downstream load. The cache is an LRU cache that is flushed regularly. Thus, while the cache reduces downstream load, it does not guarantee how many outputs per window you get. (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html)
If you strictly need a single output per window, using the Processor API is the right way to go.

I haven't tried it myself yet, but Kafka Streams now supports suppress operation. Take a look here:
You can use Suppress to effectively rate-limit just one KTable output
or callback. Or, especially valuable for non-retractable outputs like
alerts, you can use it to get only the final results of a windowed
aggregation. The clearest use case for Suppress at the moment is
getting final results
Based on the article, the code can look like:
events
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofMinutes(2).withGrace(Duration.ofMinutes(2))
)
.count(Materialized.as("count-metric"))
.suppress(Suppressed.untilWindowClose(BufferConfig.unbounded()))
.filter( _ < 4 )
.toStream()
.foreach( /* Send that email! */)
I'm using Kafka Streams 2.6.0 and I'm able to reuse the same approach to build a stream.

Related

How to count metrics from executions of AWS lambdas?

I have all sorts of metrics I would like to count and later query. For example I have a lambda that processes stuff from a queue, and for each batch I would like to save a count like this:
{
"processes_count": 6,
"timestamp": 1695422215,
"count_by_type": {
"type_a": 4,
"type_b": 2
}
}
I would like to dump these pieces somewhere and later have the ability to query how many were processed within a time range.
So these are the options I considered:
write the json to the logs, and later have a component (beats?) that processed these logs and send to a timeseries db.
in the end of each execution send it directly to a timeseries db (like elasticearch).
What is better in terms of cost / scalability? Are there more options I should consider?
I think Cloud Watch Embedded Metric Format (EMF) would be good here. There are client libraries for Node.js, Python, Java, and C#.
CW EMF allows you to push metrics out of Lambda into CloudWatch in a managed async way. So it's a cost-effective and low-effort way of producing metrics.
The client library produces a particular JSON format to stdout, when CW sees a message of this type it automatically creates the metrics for you from it.
You can also include key-value pairs in the EMF format which allows you to go back and query the data with these keys in the future.
High-level clients are available with Lambda Powertools in Python and Java.

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

Using my own Cassandra driver to write aggregations results

I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.
My code for this looks something like this:
KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
.groupByKey()
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")
Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.
My questions now are two:
Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.
I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.
Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).
You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.
Compare: External system queries during Kafka Stream processing
2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources