While readying about topology optomization, i stumble upon the following:
Currently, there are two optimizations that Kafka Streams performs
when enabled:
1 - The source KTable re-uses the source topic as the changelog topic.
2 - When possible, Kafka Streams collapses multiple repartition topics
into a single repartition topic.
This question is for the first point. I do not fully understand what is happening under the hood here. Just to make sure that i am not making any assumption here. Can someone explain, what was the state before:
1 - Do the KTable, use an internal changelog topic ? if yes, can someone point me to a doc about that ? Next, what is in that changelog topic ? Is it the actually upsert log, comsposed of update operation ?
2 - If my last guess is true, i do not understand how that changelog composed of upsert can be replace by the source topic only ?
A changelog topic is a Kafka topic configured with log compaction. Each update to the KTable is written into the changelog topic. Because the topic is compacted, no data is ever lost and re-reading the changelog topic allows to re-create the local store.
The assumption of this optimization is, that the source topic is a compacted topic. For this case, the source topic and the corresponding changelog topic would contain the exact same data. Thus, the optimization removes the changelog topic and uses the source topic to re-create the state store during recovery.
If your input topic is not compacted but applies a retention time, you might not want to enable the optimization as this could result in data loss.
About the history: Initially, Kafka Streams had this optimization hardcoded (and thus "forced" users to only read compacted topics as KTables if potential data loss is not acceptable). However, in version 1.0 a regression bug was introduced (via https://issues.apache.org/jira/browse/KAFKA-3856: the new StreamsBuilder behavior was different to old KStreamBuilder and StreamsBuilder would always create a changelog topic) "removing" the optimization. In version 2.0, the issue was fixed and the optimization is available again. (cf https://issues.apache.org/jira/browse/KAFKA-6874)
Note: the optimization is only available for source KTables. For KTables that are the result of an computation, like an aggregation or other, the optimization is not available and a changelog topic will be created (if not explicitly disabled what disables fault-tolerance for the corresponding store).
Related
We use Kafka topics as both events and a repository. Using the kafka-streams API we define a simple K-Table that represents all the events in the topic.
In our use case we publish events to the topic and subsequently reference the K-Table as the backing repository. The main issue is that the published events are not immediately visible on the K-Table.
We tried transactions and exactly once semantics as described here (https://kafka.apache.org/26/documentation/streams/core-concepts#streams_processing_guarantee) but there is always a delay we cannot control.
Publish Event
Undetermined amount of time
Published Event is visible in the K-Table
Is there a way to eliminate the delay or otherwise know that a specific event has been consumed by the K-Table.
NOTE: We tried both partition and global tables with similar results.
Thanks
Because Kafka is an asynchronous system the observed delay is expected and you cannot do anything to avoid it.
However, if you publish a message to a topic, the KafkaProducer allows you to pass in a Callback to the send() method and the callback will be executed after the message was written to the topic providing the record's metadata like topic, partition, and offset.
After Kafka Streams processed messages, it will eventually commit the offsets (you can configure the commit interval, too). Thus, you can know if the message is in the KTable after the offset was committed. By default, committing happens every 30 seconds only and it's not recommended to use a very short commit interval because it implies large overhead. Thus, I am not sure if this would help for your case, as it seem you want a more timely "response".
As an alternative, you can also disable caching on the KTable and use a toStream().process() step -- after each update to the KTable, the changelog stream provided by toStream() will contain the record and you can access the record metadata (including its offset) in the Processor via the given ProcessorContext object. Thus should also allow you to figure out, when the record is available in the KTable.
Following is from the Kafka Documentation for 2.1.
https://kafka.apache.org/documentation/
Offset expiration semantics has slightly changed in this version.
According to the new semantics, offsets of partitions in a group will
not be removed while the group is subscribed to the corresponding
topic and is still active (has active consumers). If group becomes
empty all its offsets will be removed after default offset retention
period (or the one set by broker) has passed (unless the group becomes
active again). Offsets associated with standalone (simple) consumers,
that do not use Kafka group management, will be removed after default
offset retention period (or the one set by broker) has passed since
their last commit.
If I understand this correctly, as long as Stream Thread consumer's are connected, no retention setting will be effective?
I also started to observe following Exception after the restart of stream application
stream thread - Restoring Stream Tasks failed. Deleting StreamTasks stores to recreate from scratch.
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions:' but stream application uses the property 'StreamsConfig.consumerPrefix(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), "earliest"'...
I think it has to do something with retention but I can't tell what?
If I understand this correctly, as long as Stream Thread consumer's are connected, no retention setting will be effective?
This applies to __consumer_offset topic only, that is a Kafka internal topic. For all regular/user topics, retention time is applied the same way as in all previous versions. Also note, this only applies if you upgrade your brokers to 2.1.
For the log message of Streams: you don't need to worry about it. It seems that your application was offline for a longer time, and thus, your local store is not in a consistent state any longer. Thus, it's deleted and recreated from scratch from the changelog topic.
I am stuck in a typical use case or scenario where I am not sure what will be the behavior of Kafka..
SCENERIO : I am using Spring Kafka with spring Boot. In my application I am having one Rest end point which will read all messages from the beginning of a topic to check for the duplication of message then write to topic if not duplicate.
I am confused about what will be the behavior of the application when multiple instances of same microservice are deployed and offset is moved for seekFromBegining operation.
few questions in my mind are :
do reading from beginning of a topic (with the help of seek) block the topic ?
If Yes. then how to solve this typical use case where we have to validate for the
duplication of message before writing to the topic.
Using DB is not a solution because it will be resource intensive. and make the application slower.
Thanks everyone in Advance
Sounds like you need a Log Compaction feature:
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition.
Therefore when you specify some unique message key, you won't have more than one of them in the partition. And with that you don't need to read topic before storing at all.
I'm still working on a Kafka Streams application that I described in
Why isn't Kafka consumer producing results?. In that posting, I asked why setting
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
doesn't appear to reset the state of Kafka to "start of the universe" before any data are pushed to any topic. I am now encountering a variant of that issue:
My application consists of a producer program that pushes data to a Kafka stream and a consumer program that groups the data, aggregates the groups, and then converts the resulting KTable back into a stream, which I print out.
The aggregation step is essentially adding up all the values, then putting those sums into the output stream as new data. What I observe, though, is that every time I run the program, the resulting aggregated values get bigger and bigger, almost as if Kafka is somehow retaining the previous results and including those in the aggregation.
In order to try fixing this, I deleted all my topics (except for __consumer_offsets, which Kafka would not allow), then re-ran my application, but the aggregated values continue to grow, as if Kafka were retaining the result of previous computations even though I thought that deleting the intermediate topics would fix things. I even tried stopping and restarting the Kafka server, to no avail.
What's going on here and, more to the point, how can I fix this? I've tried various suggestions about setting AUTO_OFFSET_RESET_CONFIG, also with no effect. I should mention that one aspect of my application is that my original producer creates its own Kafka timestamps in the Producer.send call, although disabling that also seemed to have no effect.
Thanks in advance, -- Mark
AUTO_OFFSET_RESET_CONFIG only triggers if there are not committed offsets: If an application starts, it first looks for committed offsets and applies the reset policy only, if there are no valid offsets.
Furthermore, for a Kafka Streams application, resetting offsets would not be sufficient and you should use the reset tool bin/kafka-streams-applicaion-reset.sh -- this blog post explains the tool in details: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.