Do valueTransformer in kafka stream can have access to tombstone? - apache-kafka-streams

is it possible to have access to the tombstone that an upstream Ktable process ?
Let say I would like to apply a logic to every deletion that happen to an upstream KTable, that is, do ValueTransformer get forwarded delete Event ?
kTable.transform(delete Event)
if not, is there another way to achieve this ?

Related

What is the most efficient way to know that a Kafka event is visible in a K-Table?

We use Kafka topics as both events and a repository. Using the kafka-streams API we define a simple K-Table that represents all the events in the topic.
In our use case we publish events to the topic and subsequently reference the K-Table as the backing repository. The main issue is that the published events are not immediately visible on the K-Table.
We tried transactions and exactly once semantics as described here (https://kafka.apache.org/26/documentation/streams/core-concepts#streams_processing_guarantee) but there is always a delay we cannot control.
Publish Event
Undetermined amount of time
Published Event is visible in the K-Table
Is there a way to eliminate the delay or otherwise know that a specific event has been consumed by the K-Table.
NOTE: We tried both partition and global tables with similar results.
Thanks
Because Kafka is an asynchronous system the observed delay is expected and you cannot do anything to avoid it.
However, if you publish a message to a topic, the KafkaProducer allows you to pass in a Callback to the send() method and the callback will be executed after the message was written to the topic providing the record's metadata like topic, partition, and offset.
After Kafka Streams processed messages, it will eventually commit the offsets (you can configure the commit interval, too). Thus, you can know if the message is in the KTable after the offset was committed. By default, committing happens every 30 seconds only and it's not recommended to use a very short commit interval because it implies large overhead. Thus, I am not sure if this would help for your case, as it seem you want a more timely "response".
As an alternative, you can also disable caching on the KTable and use a toStream().process() step -- after each update to the KTable, the changelog stream provided by toStream() will contain the record and you can access the record metadata (including its offset) in the Processor via the given ProcessorContext object. Thus should also allow you to figure out, when the record is available in the KTable.

Kafka Streams: How to avoid forwarding downstream twice when repartitioning

In my application I have KafkaStreams instances with a very simple topology: there is one processor, with a key-value store, and each incoming message gets written to the store and is then forwarded downstream to a sink.
I would like to increase the number of partitions I have for my source topic, and then reprocess the data, so that each store will contain only keys relevant to its partition. (I understand this is done using the Application Reset Tool). However, while reprocessing the data, I don't want to forward anything downstream; I want only new data to be forwarded. (Otherwise, consumers of the result topic will handle old values again). My question: is there an easy way to achieve this? Any build-in mechanism that can assist me in telling reprocessed data and new data apart maybe?
Thank you in advance
There is not build-in mechanism. But you might be able to just remove the sink operation that is writing to the result topic when you reprocess your data -- when reprocessing is done, you stop the application, add the sink again and restart. Not sure if this works for you.
Another possible solution might be, to use a transform() an implement an offset-based filter. For each input topic partitions, you get the offset of the first new message (this is something you need to do manually before you write the Transformer). You use this information, to implement a filter as a custom Transformer: for each input record, you check the record's partition and offset and drop it, if the record's offset is smaller then the offset of the first new message of this partition.

CQRS + Microservices Handling event rollback

We are using microservices, cqrs, event store using nodejs cqrs-domain, everything works like a charm and the typical flow goes like:
REST->2. Service->3. Command validation->4. Command->5. aggregate->6. event->7. eventstore(transactional Data)->8. returns aggregate with aggregate ID-> 9. store in microservice local DB(essentially the read DB)-> 10. Publish Event to the Queue
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
Any suggestions would be highly appreciated.
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
You retry it later.
The "book of record" is the event store. The downstream views (the "published events", the read models) are derived from the book of record. They are typically behind the book of record in time (eventual consistency) and are not typically synchronized with each other.
So you might have, at some point in time, 105 events written to the book of record, but only 100 published to the queue, and a representation in your service database constructed from only 98.
Updating a view is typically done in one of two ways. You can, of course, start with a brand new representation and replay all of the events into it as part of each update. Alternatively, you track in the metadata of the view how far along in the event history you have already gotten, and use that information to determine where the next read of the event history begins.
Inside your event store, you could track whether read-side replication was successful.
As soon as step 9 suceeds, you can flag the event as 'replicated'.
That way, you could introduce a component watching for unreplicated events and trigger step 9. You could also track whether the replication failed multiple times.
Updating the read-side (step 9) and flagigng an event as replicated should happen consistently. You could use a saga pattern here.
I think i have now understood it to a better extent.
The Aggregate would still be created, answer is that all the validations for any type of consistency should happen before my aggregate is constructed, it is in case of a failure beyond the purview of the code that a failure exists while updating the read side DB of the microservice which needs to be handled.
So in an ideal case aggregate would be created however the event associated would remain as undispatched unless all the read dependencies are updated, if not it remains as undispatched and that can be handled seperately.
The Event Store will still have all the event and the eventual consistency this way is maintained as is.

#KafkaListener should pull new data only when a certain conditions is met, If condition fails pulling of data should stop until the condition is met

The use case that I am working on is that message received from KafkaListener triggers an Async method. I want this Aysnc method to finish and only then receive a new message from kafka queue. Any ideas or suggestions regarding this implementation? Can kakfka support such kind of a scenario.
eg
while(asyncMethod.idle()){
#KafkaListener(String data)
public void listen(){
process(message);
asyncMethod.execute();
}
}
I am confused by this question, but it sounds like you would want to make this synchronous vs. asynchronous?
Either that or you could implement a lock to basically make sure that it doesn't listen unless the lock is false and set the lock to true once it has received a message.
You may want to work on your implementation/architecture though, Kafka shouldn't be used to maintain order or block that way.

Tokbox- don't let the same user publish twice

If a user is publishing to a tokbox session and for any reason that same user logs in on a different device or re-opens the session in another browser window I want to stop the 2nd one from publishing.
Luckily, on the metadata for the streams, I am saving the user id, so when there is a list of streams it's easy to see if an existing stream belongs to the user that is logged in.
When a publisher gets initialized the following happens:
Listen for session.on("streamCreated") when this happens, subscribe to new streams
Start publishing
The problem is, when the session gets initialized, there is no way to inspect the current streams of the session to see if this user is already publishing. We don't know what the streams are until the on("streamCreated") callback fires.
I have a hunch that there is an easy solution that I am missing. Any ideas?
I assume that when you said you save the user ID on the stream metadata, that means when you initialize the Publisher, you set the "name" property. That's a great technique.
My idea is slightly a hack, but its the best I can come up with right now. I would solve this problem by essentially breaking up the subscription of streams into 2 phases:
all streams created before this client connection
all streams created after
During #1 I would check each stream's "name" property to see if it belongs to the user at this client connection. If it does, then you know they are entering the session twice and you can set a flag (lets call it "userRejoining". In order to know that #1 is complete, I would set a timer (this is why I call it a hack) for a reasonable amount of time such as 1 second each time a "streamCreated" event arrives, and remove the any previous timer.
Then, if the "userRejoining" flag is not set, the Publisher is initialized and published to the session.
During #2, you just subscribe to any stream that is created.
The downside is that you've now delayed your user experience of publishing by ~1 second everywhere. In larger group scenarios this could be a deal breaker, but in smaller (1:1) types of sessions this should be acceptable. I hope this explanation is clear, and if not I can try to write some sample code for you.

Resources