retention.ms not set on changelog topics created by WindowsStoreBuilder - apache-kafka-streams

Kafka stream automatically set retention.ms and cleanup.policy on internal topics, ex. materialized KTables. However, I observed that retention.ms isn't set on logging topics created by WindowStoreBuilder.
Intuitively, I had say that logging topics are also internal topics and could / should be auto-configured. In my case, store has very short retention few minutes but default retention.ms is 5 days. Restoring a local store takes one hour while we only care about the last few minutes of data and older data will be evicted from the store.
Am I doing something wrong or should I explicitly set those config entries?
var retention = Duration.ofMinutes( 5 );
var storeBuilder =
new WindowStoreBuilder<>(
Stores.persistentWindowStore(
"name",
retention,
Duration.ofMinutes( 1 ),
false ),
Serdes.String( ), new JaegerSpanSerde( ), Time.SYSTEM )
.withLoggingEnabled( Map.of(
TopicConfig.RETENTION_MS_CONFIG, Long.toString( retention.toMillis( ) ),
TopicConfig.SEGMENT_MS_CONFIG, Long.toString( retention.toMillis( ) ) ) );
If retention is increased in a subsequent release and changelog topic already exist, will topic configuration be updated?

Internal topics are only configures by Kafka Streams when they are creates. If you change your retention time setting in your code, it won't update the corresponding topic config. This is a known issue: https://issues.apache.org/jira/browse/KAFKA-7591
As a workaround, you can manually reconfigure the changelog topic.

Related

Why Kafka streams creates topics for aggregation and joins

I recently created my first Kafka stream application for learning. I used spring-cloud-stream-kafka-binding. This is a simple eCommerce system, in which I am reading a topic called products, which have all the product entries whenever a new stock of a product comes in. I am aggregating the quantity to get the total quantity of a product.
I had two choices -
Send the aggregate details (KTable) to another kafka topic called aggregated-products
Materialize the aggregated data
I opted second option and what I found out that application created a kafka topic by itself and when I consumed messages from that topic then got the aggregated messages.
.peek((k,v) -> LOGGER.info("Received product with key [{}] and value [{}]",k, v))
.groupByKey()
.aggregate(Product::new,
(key, value, aggregate) -> aggregate.process(value),
Materialized.<String, Product, KeyValueStore<Bytes, byte[]>>as(PRODUCT_AGGREGATE_STATE_STORE).withValueSerde(productEventSerde)//.withKeySerde(keySerde)
// because keySerde is configured in application.properties
);
Using InteractiveQueryService, I am able to access this state store in my application to find out the total quantity available for a product.
Now have few questions -
why application created a new kafka topic?
if answer is 'to store aggregated data' then how is this different from option 1 in which I could have sent the aggregated data by my self?
Where does RocksDB come into picture?
Code of my application (which does more than what I explained here) can be accessed from this link -
https://github.com/prashantbhardwaj/kafka-stream-example/blob/master/src/main/java/com/appcloid/kafka/stream/example/config/SpringStreamBinderTopologyBuilderConfig.java
The internal topics are called changelog topics and are used for fault-tolerance. The state of the aggregation is stored both locally on the disk using RocksDB and on the Kafka broker in the form of a changelog topic - which is essentially a "backup". If a task is moved to a new machine or the local state is lost for a different reason, the local state can be restored by Kafka Streams by reading all changes to the original state from the changelog topic and applying it to a new RocksDB instance. After restoration has finished (the whole changelog topic was processed), the same state should be on the new machine, and the new machine can continue processing where the old one stopped. There are a lot of intricate details to this (e.g. in the default setting, it can happen that the state is updated twice for the same input record when failures happen).
See also https://developer.confluent.io/learn-kafka/kafka-streams/stateful-fault-tolerance/

kstream topology with inmemory statestore data not commited

I need to aggregate client information and every hours push it to an output topic.
I have a topology with :
input-topic
processor
sink topic
Data arrives in input-topic with a key in string which contains a clientID concatenated with date in YYYYMMDDHH
.
In my processor I use a simple InMemoryKeyValueStore (withCachingDisabled) to merge/aggregate data with specific rules (data are sometime not aggregated according to business logic).
In a punctuator, every hours the program parse the statestore to get all the messages transform it and forward it to the sink topic, after what I clean the statestore for all the message processed.
After the punctuation, I ask the size of the store which is effectivly empty (by .all() and
approximateNumEntries), every thing is OK.
But when I restart the application, the statstore is restored with all the elements normally deleted.
When I parse manually (with a simple KafkaConsumer) the changelog topic of the statestore in Kafka, I view that I have two records for each key :
The first record is commited and the message contains my aggregation.
The second record is a deletion message (message with null) but is not commited (visible only with read_uncommitted) which is dangerous in my case because the next punctuator will forward again the aggregate.
I have play with commit in the punctuator which forward, I have create an other punctuator which commit the context periodically (every 3 seconds) but after the restart I still have my data restored in the store (normal my delete message in not commited.)
I have a classic kstream configuration :
acks=all
enable.idempotence=true
processing.guarantee=exactly_once_v2
commit.interval.ms=100
isolation.level=read_committed
with the last version of the library kafka-streams 3.2.2 and a cluster in 2.6
Any help is welcome to have my record in the statestore commited. I don't use TimeWindowedKStream which is not exactly my need (sometime I don't aggregate but directly forward)

Is there any option of cold-bootstraping a persistent store in Kafka streams?

I have been working on kafka-streams for a couple of months. We are using RocksDB to store data. Now, changelog topic keeps data of only a few days and if our application's persistent stores have data of few months. How will store state be restored if a partition is moved from one node to another(which I think, happens through changelog).
Also, if the node goes containing active task and a new node is introduced. So, the replica will be promoted to active and a new replica will start building on this new node. So, if changelog has only few days of data the new replica will have only that data, instead of original few months.
So, is there any option where we can transfer data to a replica from the active store rather than changelog(as it only has fraction of data).
Changelog topics that are used to backup stores don't have a retention time but are configured with log-compaction enabled (cf. https://kafka.apache.org/documentation/#compaction). Thus, it's guaranteed that no data is lost no matter how long you run. The changelog topic will always contain the exact same data as your RocksDB stores.
Thus, for fail-over or scale-out, when a task migrates and a store need to be rebuild, it will be a complete copy of the original store.

Kafka Stream work with JoinWindow for data replay

I have 2 streams of data and I want to be able to join them for a window of 1 month let's say. When I have a live data everything is fun and super easy with KStream and join. I did something like this;
KStream<String, GenericRecord> stream1 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic1());
KStream<String, GenericRecord> stream2 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic2());
long joinWindowSizeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days
KStream<String, GenericRecord> joinStream = stream1.join(stream2,
new ValueJoiner<GenericRecord, GenericRecord, GenericRecord>() {
#Override
public GenericRecord apply(GenericRecord genericRecord, GenericRecord genericRecord2) {
final GenericRecord jonnedRecord = new GenericData.Record(jonnedRecordSchema);
....
....
....
return jonnedRecord;
}
}, JoinWindows.of(joinWindowSizeMs));
The problem appears when I want to do a data replay. let's say I want to re-do these join for the data I have for past 6 months since I am running the pipeline for all data at once kafkaStream will join all the joinable data and it doesn't take the time difference into consideration (which it should only join past one month of data). I am assuming the JoinWindow time is the time we insert data into Kafka topic, am I right?
And how can I change and manipulate this time so I can run my data replay correctly, I mean for re-inserting these past 6 months of data it should take a window of one month for each respective record and join based one that.
This question is not duplicate of How to manage Kafka KStream to Kstream windowed join?, there I asked about how can I can join based on the window of time. here I am talking about data replay. from my understanding during join Kafka take the time that data is inserted to the topic as the time for JoinWindow, so if you want to do the data replay and re-insert the data for 6 month ago kafka take it as a new data which is inserted today and gonna join it with some othrr data that is actually for today which it shouldn't.
Kafka's Streams API uses timestamps returned by TimestampExtractor to compute joins. By default, this is the record's embedded metadata timestamp. (c.f. http://docs.confluent.io/current/streams/concepts.html#time)
Per default, KafkaProducer sets this timestamp to current system time on write. (As an alternative, you can configure brokers on a per-topic basis to overwrite producer-provided timestamps of records with the broker's system time at the time the broker stored the record -- this provides "ingestion time" semantics.)
Thus, it is not a Kafka Streams issue per se.
There are multiple options to tackle the problem:
If your data is already in a topic, you can simply reset your Streams application to reprocess old data. For this, you can use the application reset tool (bin/kafka-streams-application-reset.sh). You also need to specify auto.offset.reset policy to earliest in your Streams app. Check out the docs -- also, it's recommended to read the blog post.
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
This is the best approach, as you do not need to write data to the topic again.
If your data is not in a topic and you need to write the data, you can set the record timestamp explicitly at the application level, by providing a timestamp for each record:
KafkaProducer producer = new KafkaProducer(...);
producer.send(new ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value));
Thus, if you ingest old data you can set the timestamp explicitly and Kafka Streams will pick it up and compute the join accordingly.

logstash kafka input performance / config tuning

I use logstash to transfer data from Kafka to Elasticsearch and I'm getting the following error:
WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group kafka-es-sink: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I tried to adjust the session timeout (to 30000) and max poll records (to 250).
The topic produces 1000 events per seconds in avro format. There are 10 partitions (2 servers) and two logstash instances with 5 consumer threads each.
I have no problems with other topics with ~100-300 events per second.
I think it should be a config issue because I also have a second connector between Kafka and Elasticsearch on the same topic which works fine (confluent's kafka-connect-elasticsearch)
The main aim is to compare kafka connect and logstash as connector. Maybe anyone has also some experience in general?

Resources