Can Kafka Stream sub tasks delete from state store simultaneously - apache-kafka-streams

I have one question w.r.t kafka-stream state stores getAll() and delete().
My topology is simple it has one source topic, one processor node and sink topic. I need to maintain state of the stream so I'm using KeyValue based state store.
The processor node reads saves the state in KVStore(rocksdb) in process() and then punctuator scheduled at certain time performs getAll() from the state store, performs certain complex business logic and once done deletes the retrieved entries/keys from stateStore.
Now, my question is if there are multiple sub-tasks(stream threads) running this logic of doing getAll from stateStore and then delete. Is it possible that each of these different sub-tasks(threads) end up getting all the keys and then each of them attempting delete ?
Or is in other words if the KeyValueIterator<K, V> all() returns all the keys across all the partitions of the stateStore OR only the shard'ed partition dedicated to that sub-task(thread).
Thanks,

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

Spring boot kafka: Microservice multi instances, concurrency and partitions

I have a question about the way of publishing and reading messages in kafka for microservices arquitectures with multiple instance of the same microservices for writing and reading.
My main problem here is that the microservices that publish and read are configure with an autoscaling but a default numer of instances of 1.
The point is that I have an entity, let call it "Event" that are stored in the DDBB and each entity has its own ID in the DDBB. When some specific command are executed in a specific entity (let say with entityID = ajsha87) it must be published a message that will be readed by a consumer. if each of this messages for the same entity is writen in diferent partitions and cosumed at the same time (Concurrency issue) I will have a lot of problems.
My question is about if according to the entityID for example I can set in which partitions all events of this specific entity will be published. For another entity with different ID I dont care about the partion but the messages for the same entity must be always published in the same partition to avoid that a consumer will read a messages (2) published after a message (1).
There is any mechanism to do that, or each time I save the entity I have randomly store in the DDBB the partition ID in which its messages will be published?
Same happens with consumers. Only one consumer can read a partition at the same time because if not, a consumer number 1 can read the message (1) from partition (1) realted with entity (ID=78198) and then another can read the message (2) from partition (1) ralated with the same entity and process the message 2 before number one.
There is any mechanish about subscribe each instance only to one partition according to the microservice autoscaling?
Another option it will be to assign dinamically for each new publisher instance a partition, but I dont know how to configure that dinamically to set diferent particions IDs according to the microservice instance
I am using spring boot by the way
Thanks for you answer and recomendations and sorry if my english is not good enough.
If you use Hash Partitioner as the partitioner in producer config (This is the default partitioner in many libraries), and use same key for same entity (let say with entityID = ajsha87) kafka manages to send all messages with same key to same partition.
If you are using group consumer, One consumer instance take the responsibility of one partition and all messages published to that partition consumes by that instance only. Instance can be changed if there is rebalancing when upscaling. but still messages in same partition will read from one consumer instance.

A SpringBatch job to produce events for a PubSub preserving source order

I'm considering to create a SpringBatch job that uses rows from a table to create events and push the events to a PubSub implementation. The problem here is that the order of events should be the same as the order of the rows in the table that used as source for the events creation process.
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel. The only ugly but probably working solution for this problem would be to do all the processing in the reader (so the reader could do reading+processing+writing to PubSub), that could help to keep order inside paginated batches, but even that doesn't seem to guarantee the batches order, according to the doc
Any ideas how the transition ordered rows->ordered events could be implemented using SpringBatch or, at least, SpringBoot? Thank you in advance!
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel.
This is true only for a multi-threaded or a partitioned step. The default (single-threaded) chunk-oriented step implementation processes items in the same order returned by the item reader. So if you make your database reader return items in the order you want, those items will be written to your Pub/Sub broker in the same order.

Kafka state-store on different scaled instances

I have 5 different machine with each scaled 5 spring boot instance that uses kafka-streams application. I am using 50 partitions compacted topic with different 2-3 topics and each my instance has 10 concurrency. I am using docker swarm and docker volume. Using these topics KTable or KStream do some flatMap, map and join operations with my kafka streams app.
props.put(StreamsConfig.STATE_DIR_CONFIG, /tmp/kafka-streams);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 2);
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put("num.stream.threads", 10);
props.put("application.id", applicationId);
If everything goes OK nothing is wrong or no data loss in my application with .join() operations, but when one of my instance is down my join operations are not able to do the join actually.
My question is: When the app is restarted or redeployed (and given that it's working inside a non-persistent container) its state is gone right? Than my join operations don't work. It is When I redeploy my instance and populate my compacted topic from elasticsearch with the latest entities my join operations are OK. So I think when my application starts at new machine my local state-store is gone ? But the kafka document says:
If tasks run on a machine that fails and are restarted on another machine, Kafka Streams guarantees to restore their associated state stores to the content before the failure by replaying the corresponding changelog topics prior to resuming the processing on the newly started tasks. As a result, failure handling is completely transparent to the end user.
Note that the cost of task (re)initialization typically depends primarily on the time for restoring the state by replaying the state stores' associated changelog topics. To minimize this restoration time, users can configure their applications to have standby replicas of local states (i.e. fully replicated copies of the state). When a task migration happens, Kafka Streams then attempts to assign a task to an application instance where such a standby replica already exists in order to minimize the task (re)initialization cost. See num.standby.replicas at the Kafka Streams Configs Section.
(https://kafka.apache.org/0102/documentation/streams/architecture)
Does my downed instance refresh kafka state-store when it goes up ? If it is why I am losing data and I have no idea :/ Or can't reload state-store because of commit_offset because all my instance's use same applicationId ?
Thanks !
The changelog topics are always read from the earliest offset, and they're compacted, so they don't lose data.
If you're joining non compact topics, then sure, you lose data, but that's not limited to Kafka Streams or your specific use case... You'll need to configure the topic to retain data for at least as long as you think it'll take you to solve any issues with topic downtime. While the data is retained, you can always seek your consumer to it
If you want persistent storage, use a volume mount to your container via Kubernetes, for example, or plug in a state state store stored externally to the container like Redis : https://github.com/andreas-schroeder/redisks

State store partitioned iterator?

I have a Kafka-streams transformer which functions like a windower: it accumulates state into a state store in transform() and then forwards it in an output topic during punctuate(), with the state store topic partition key the same as the input topic.
During punctuate(), I would like each StreamThread to only iterate its own partition of the state store to minimize the amount of data to be read from the backing kafka topic. But the only iterator I can get is through
org.apache.kafka.streams.state.ReadOnlyKeyValueStore<K,V>.all()
which iterates through the whole state store.
Is there any way to "assign partitions" of a state store and make punctuate() iterate only on the assigned partitions?
I guess, ReadOnlyKeyValueStore<K,V>.all() does what you want. Note, that the overall state is sharded into multiple stores with one shard/store per partitions. all() does not iterate through "other shards". "all" means "everything local", ie, everything from the shard of a single partition.

Resources