I could not find any indication about the time used for the commit interval on KTable. Is it Walk-Clock-Time or Stream Time , Producer Time ?
If you refer to commit.interval.ms config, than the answer is wall-clock time.
Related
I need to aggregate client information and every hours push it to an output topic.
I have a topology with :
input-topic
processor
sink topic
Data arrives in input-topic with a key in string which contains a clientID concatenated with date in YYYYMMDDHH
.
In my processor I use a simple InMemoryKeyValueStore (withCachingDisabled) to merge/aggregate data with specific rules (data are sometime not aggregated according to business logic).
In a punctuator, every hours the program parse the statestore to get all the messages transform it and forward it to the sink topic, after what I clean the statestore for all the message processed.
After the punctuation, I ask the size of the store which is effectivly empty (by .all() and
approximateNumEntries), every thing is OK.
But when I restart the application, the statstore is restored with all the elements normally deleted.
When I parse manually (with a simple KafkaConsumer) the changelog topic of the statestore in Kafka, I view that I have two records for each key :
The first record is commited and the message contains my aggregation.
The second record is a deletion message (message with null) but is not commited (visible only with read_uncommitted) which is dangerous in my case because the next punctuator will forward again the aggregate.
I have play with commit in the punctuator which forward, I have create an other punctuator which commit the context periodically (every 3 seconds) but after the restart I still have my data restored in the store (normal my delete message in not commited.)
I have a classic kstream configuration :
acks=all
enable.idempotence=true
processing.guarantee=exactly_once_v2
commit.interval.ms=100
isolation.level=read_committed
with the last version of the library kafka-streams 3.2.2 and a cluster in 2.6
Any help is welcome to have my record in the statestore commited. I don't use TimeWindowedKStream which is not exactly my need (sometime I don't aggregate but directly forward)
Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.
Which partition strategy Kafka stream uses ? Can we change the partition strategy in Kafka Stream as we can change in normal Kafka Consumer
streamsConfiguration.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,Collections.singletonList(StickyAssignor.class));
makes no difference and always StreamsPartitionAssignor is used
No. You cannot set an partition assignor.
Kafka Streams has very specific requirements how partition assignment works and if not done correctly, incorrect result could be computed. Thus, it's not allowed to set a custom partitions assignor.
I have 2 streams of data and I want to be able to join them for a window of 1 month let's say. When I have a live data everything is fun and super easy with KStream and join. I did something like this;
KStream<String, GenericRecord> stream1 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic1());
KStream<String, GenericRecord> stream2 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic2());
long joinWindowSizeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days
KStream<String, GenericRecord> joinStream = stream1.join(stream2,
new ValueJoiner<GenericRecord, GenericRecord, GenericRecord>() {
#Override
public GenericRecord apply(GenericRecord genericRecord, GenericRecord genericRecord2) {
final GenericRecord jonnedRecord = new GenericData.Record(jonnedRecordSchema);
....
....
....
return jonnedRecord;
}
}, JoinWindows.of(joinWindowSizeMs));
The problem appears when I want to do a data replay. let's say I want to re-do these join for the data I have for past 6 months since I am running the pipeline for all data at once kafkaStream will join all the joinable data and it doesn't take the time difference into consideration (which it should only join past one month of data). I am assuming the JoinWindow time is the time we insert data into Kafka topic, am I right?
And how can I change and manipulate this time so I can run my data replay correctly, I mean for re-inserting these past 6 months of data it should take a window of one month for each respective record and join based one that.
This question is not duplicate of How to manage Kafka KStream to Kstream windowed join?, there I asked about how can I can join based on the window of time. here I am talking about data replay. from my understanding during join Kafka take the time that data is inserted to the topic as the time for JoinWindow, so if you want to do the data replay and re-insert the data for 6 month ago kafka take it as a new data which is inserted today and gonna join it with some othrr data that is actually for today which it shouldn't.
Kafka's Streams API uses timestamps returned by TimestampExtractor to compute joins. By default, this is the record's embedded metadata timestamp. (c.f. http://docs.confluent.io/current/streams/concepts.html#time)
Per default, KafkaProducer sets this timestamp to current system time on write. (As an alternative, you can configure brokers on a per-topic basis to overwrite producer-provided timestamps of records with the broker's system time at the time the broker stored the record -- this provides "ingestion time" semantics.)
Thus, it is not a Kafka Streams issue per se.
There are multiple options to tackle the problem:
If your data is already in a topic, you can simply reset your Streams application to reprocess old data. For this, you can use the application reset tool (bin/kafka-streams-application-reset.sh). You also need to specify auto.offset.reset policy to earliest in your Streams app. Check out the docs -- also, it's recommended to read the blog post.
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
This is the best approach, as you do not need to write data to the topic again.
If your data is not in a topic and you need to write the data, you can set the record timestamp explicitly at the application level, by providing a timestamp for each record:
KafkaProducer producer = new KafkaProducer(...);
producer.send(new ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value));
Thus, if you ingest old data you can set the timestamp explicitly and Kafka Streams will pick it up and compute the join accordingly.
I use logstash to transfer data from Kafka to Elasticsearch and I'm getting the following error:
WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group kafka-es-sink: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I tried to adjust the session timeout (to 30000) and max poll records (to 250).
The topic produces 1000 events per seconds in avro format. There are 10 partitions (2 servers) and two logstash instances with 5 consumer threads each.
I have no problems with other topics with ~100-300 events per second.
I think it should be a config issue because I also have a second connector between Kafka and Elasticsearch on the same topic which works fine (confluent's kafka-connect-elasticsearch)
The main aim is to compare kafka connect and logstash as connector. Maybe anyone has also some experience in general?