How to be notified about updates to state store of GlobalKTable? - apache-kafka-streams

I am using simply API of StreamsBuilder for building a GlobalKTable as this:
Materialized<Long, Category, KeyValueStore<Bytes, byte[]>> materialized =
Materialized.<Long, Category, KeyValueStore<Bytes, byte[]>>as(this.categoryStoreName)
.withCachingDisabled()
.withKeySerde(Serdes.Long())
.withValueSerde(CATEGORY_JSON_SERDE);
return streamsBuilder.globalTable(categoryTopic, materialized);
I would like to be notified by changes of it. It is rarely updated and in case an update I would like to trigger cache invalidation. What is the Kafka way of doing this?

GlobalKTable does not support this. However, you can use a "global store" and implement your custom Processor that will be called for each update.
Internally, a GlobalKTable uses a "global store" and provides the Processor implementation for you.
You can add a global store via StreamsBuilder#addGlobalStore().

Related

Kafka Streams GlobalKTable and accessing the record headers

In the same way KStream and KTable#toStream() allow calling process or transform and thus enable inspecting the record headers, is there a way to achieve the same with GlobalKTable. Basically, I am looking for a way to inspect the record headers in the Kafka topic when consuming it as a GlobalKTable. Thank you!
Maybe, you could use #addGlobalStore instead?
Note thought, that the "global processor" should never modify the data but put() the key-value-pair (and maybe timestamp) unmodified into the (Timestamped)KeyValue store (cf. https://issues.apache.org/jira/browse/KAFKA-8037)

Event driven microservice - how to init old data?

I already have micro-services running and would like to add event(Kafka).
For example, I have a customer service with 10000 customers in the db. I will be adding an event to the customer service so that whenever a new user is created, it publishes an event in which will be consumed by consumers (like recommendation-service, statistics-service, etc.)
I think the above is clear to me. However, I am not sure how to handle the already-registered customers (10000 customers) as the event will only be triggered when 'NEW' customer registers.
I can 'hack' the service to sync the data manually but what does most people do in this case?
Thank you
I tried to search the topic but couldn't find the ones that I am looking for.
There are basically two strategies that you can follow here. The first is a bulk load of fake "new customer" events into the Kafka topic, as you also suggested. The second approach would be to use the change data capture (CDC) pattern where there is an initial snapshot of all the observed data and then a constant streaming of new data change events, direclty from the database internal log (WAL).
To handle your entire use case, you could use a tool like Debezium Source Connector for Kafka Connect platform, but note that you will also need to map its change event into your message format. There are plugins to do that with a configuration-driven approach, but you can also create your custom logic using single message transformations (SMT).

Can I join the KTable produced by a Kafka Streams #aggregate call before the aggregation runs?

I have a number of IOT devices that report events via messages to a kafka topic, and I have defined an aggregator to update the device state from those events.
What I'd like to do is be able to join the input stream to the KTable that the aggregator outputs before the aggregation updates the state-- that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
I've tried creating the state store with StreamsBuilder#addStateStore first, but that method returns a StreamsBuilder, and doesn't seem to provide me a way to turn it into a KTable.
I've tried joining the input stream against the KTable produced by StreamsBuilder#aggregate, but that doesn't do what I want, because it only gives me the value in the KTable after the aggregation has run, and I'd like it to run before the aggregation.
// this is fine, but it returns a StreamsBuilder and I don't see how to get a KTable out of it
streamsBuilder.addStateStore(
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(deviceStateAggregator),
Serdes.String(),
Serdes.String()
)
);
// this doesn't work because I only get doThingsBeforeStateUpdate called after the state is updated by the DeviceStateAggregator
KTable<String, DeviceState> deviceTable = deviceEventKStream
.groupByKey(Serialized.with(Serdes.String(), new deviceEventSerde()))
.aggregate(
() -> null,
new DeviceStateAggregator(),
Materialized.<String, DeviceState>as(stateStoreSupplier)
.withValueSerde(deviceStateSerde)
);
deviceEventKStream.join(deviceTable, (event, state) -> doThingsBeforeStateUpdate(event, state));
I was hoping to be able to exploit the Streams DSL to check some preconditions before the state is updated by the aggregator, but it doesn't seem possible. I'm currently exploring the idea of using a Processor, or perhaps just extending my DeviceStateAggregator to do all the pre-aggregation processing as well, but that feels awkward to me, as it forces the aggregation to care about concerns that don't seem reasonable to do as part of the aggregation.
that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
If I understand your question and notably this quote correctly, then I'd follow your idea to use the Processor API to implement this. You will need to implement a Transformer (as you want it to output data, not just read it).
As an example application that you could use as a starting point I'd recommend to look at the MixAndMatch DSL + Processor API and the CustomStreamTableJoin examples at https://github.com/confluentinc/kafka-streams-examples. The second example shows, though for a different use case, how to do custom "if this then that" logic when working with state in the Processor API, plus it also covers join functionality, which is something you want to do, too.
Hope this helps!

What is behaviour of ProcessorContext.getStateStore(String name) & ReadOnlyKeyValueStore.get(String key) in Kafka sream

I have 1.0.0 kafka stream application with two classes as updated at How to evaluate consuming time in kafka stream application. In my application, I read the events, perform some conditional checks and forward to same kafka in another topic. During my evaluation , I am getting some of expressions from Kafka with help of global table store. Observed that most of the time was taken while getting the value from store (sample code is below).
Is it read only one time from Kafka and maintain it in local store?
or
Is it read from Kafka whenever we call the org.apache.kafka.streams.state.ReadOnlyKeyValueStore.get(String key) API? If yes then how to maintain local store instead of read everytime from Kafka?
Please help.
Ex:
private KeyValueStore<String, List<String>> policyStore = (KeyValueStore<String, List<String>>) this.context
.getStateStore(policyGlobalTableName);
List<String> policyIds = policyStore.get(event.getCustomerCode());
By default, stores use an application local RocksDB instance to buffer data. Thus, if you query the store with a get() it will not go over the network and not the brokers, but only the local RocksDB.
You can try to change RocksDB setting to improve the performance, but I have no guidelines atm which configs you might wanna change. Configuring RocksDB is a quite tricky thing. But you might want to search the Internet for further information about it.
You can pass in RocksDB configs via StreamsConfig (cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter)
As an alternative, you could also try to reconfigure Streams to use in-memory stores instead of RocksDB. Note, that this will increase your rebalance time, as there is no local buffered state if you use in-memory instead of RocksDB. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-and-creating-a-state-store)

CQRS+ES: Client log as event

I'm developing small CQRS+ES framework and develop applications with it. In my system, I should log some action of the client and use it for analytics, statistics and maybe in the future do something in domain with it. For example, client (on web) download some resource(s) and I need save date, time, type (download, partial,...), from region or country (maybe IP), etc. after that in some view client can see count of download or some complex report. I'm not sure how to implement this feather.
First solution creates analytic context and some aggregate, in each client action send some command like IncreaseDownloadCounter(resourced) them handle the command and raise domain event's and updating view, but in this scenario first download occurred and after that, I send command so this is not really command and on other side version conflict increase.
The second solution is raising event, from client side and update the view model base on it, but in this type of handling my event not store in event store because it's not raise by command and never change any domain context. If is store it in event store, no aggregate to handle it after fetch for some other use.
Third solution is raising event, from client side and I store it on other database may be for each type of event have special table, but in this manner of event handle I have multiple event storage with different schema and difficult on recreating view models and trace events for recreating contexts states so in future if I add some domain for use this type of event's it's difficult to use events.
What is the best approach and solution for this scenario?
First solution creates analytic context and some aggregate
Unquestionably the wrong answer; the event has already happened, so it is too late for the domain model to complain.
What you have is a stream of events. Putting them in the same event store that you use for your aggregate event streams is fine. Putting them in a separate store is also fine. So you are going to need some other constraint to make a good choice.
Typically, reads vastly outnumber writes, so one concern might be that these events are going to saturate the domain store. That might push you towards storing these events separately from your data model (prior art: we typically keep the business data in our persistent book of record, but the sequence of http requests received by the server is typically written instead to a log...)
If you are supporting an operational view, push on the requirement that the state be recovered after a restart. You might be able to get by with building your view off of an in memory model of the event counts, and use something more practical for the representations of the events.
Thanks for your complete answer, so I should create something like the ES schema without some field (aggregate name or type, version, etc.) and collect client event in that repository, some offline process read and update read model or create command to do something on domain space.
Something like that, yes. If the view for the client doesn't actually require any validation by your model at all, then building the read model from the externally provided events is fine.
Are you recommending save some claim or authorization token of the user and sender app for validation in another process?
Maybe, maybe not. The token describes the authority of the event; our own event handler is the authority for the command(s) that is/are derived from the events. It's an interesting question that probably requires more context -- I'd suggest you open a new question on that point.

Resources