Kafka Streams - custom SessionWindows suppression with session size constraints - session

I am currently working on a Kafka Streams solution for retrieving user browsing sessions using the SessionWindows. My topology looks like:
builder
.stream(...)
.map(... => (newKey, value))
.groupByKey(...)
.windowedBy(SessionWindows.`with`(INACTIVITY_GAP).grace(GRACE))
.aggregate(... into list of events)
.suppress(Suppressed.untilWindowCloses(unbounded()))
This simple scenario works well for me, however I need to put some additional checks to the suppression logic. Namely I would like to force-flush all the sessions that exceed given size (e.g. all the sessions that have more than 1000 events inside, the events were produced within inactivity gap). My question is how this could be implemented?
I know that the .suppress() method does not accept any custom implementation of Suppressed. Therefore I was thinking about replacing the .suppress() with .transform() with my custom Transformer with a SessionStore inside that could do the suppression logic and also apply these additional checks. However, I am having a hard time when it comes to adding/deleting entries to the store and implementing the basic "untilWindowClosed" suppression by myself: I could probably do the periodic flush through ProcessorContext.schedule() but the SessionStore does not provide the possibility to iterate through all keys.
Is this a good direction? Are there any other ways to add size constraints to the sessions?

this might be what you're looking for:
org.apache.kafka.streams.kstream.Suppressed#untilTimeLimit
org.apache.kafka.streams.kstream.Suppressed.BufferConfig#maxRecords
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(WAIT_UNTIL), Suppressed.BufferConfig.maxRecords(1000)))

Related

Is there any way to replay events in a date range?

I am implementing an example of spring-boot and axon. I have two events
(deposit and withdraw account balance). I want to know is there any way to get the state of the Account Aggregate by a given date ?
I want to get not just the final state, but to replay events in a range of dates.
I think I can help with this.
In the context of Axon Framework, you can start a replay of events by telling a given TrackingEventProcessor to 'reset' it's Tokens. By the way, the current description on this in the Reference Guide can be found here.
These TrackingTokens are the objects which know how far a given TrackingEventProcessor is in terms of handling events from the Event Stream. Thus resetting/adjusting these TrackingTokens is what will issue a Replay of events.
Knowing all these, the second step is to look at the methods the TrackingEventProcessor provides to 'reset tokens', which is threefold:
TrackingEventProcessor#resetTokens()
TrackingEventProcessor#resetTokens(Function<StreamableMessageSource, TrackingToken>)
TrackingEventProcessor#resetTokens(TrackingToken)
Option one will reset your tokens to the beginning of the event stream, which will thus replay everything.
Option two and three however give you the opportunity to provide a TrackingToken.
Thus, you could provide a TrackingToken starting from several points on the Event Stream. So, how do you go about to creating such a TrackingToken at a specific point in time? To that end, you should take a look at the StreamableMessageSource interface, which has the following operations:
StreamableMessageSource#createTailToken()
StreamableMessageSource#createHeadToken()
StreamableMessageSource#createTokenAt(Instant)
StreamableMessageSource#createTokenSince(Duration)
Option 1 is what's used to create a token at the start of the stream, whilst 2 will create a token at the head of the stream.
Option 3 and 4 will however allow you to create a token at a specific point in time, thus allowing you to replay all the events since the defined instance up to now.
There is one caveat in this scenario however. You're asking to replay an Aggregate. From Axon's perspective by default the Aggregate is the Command Model in a CQRS set up, thus dealing with Commands going in to your system. In the majority of the applications, you want Commands (e.g. the requests to change something) to occur on the current state of the application. As such, the Repository provided to retrieve an Aggregate does not allow specifying a point in time.
The above described solution in regards to replaying is thus solely tied to Query Model creation, as the TrackingEventProcessor is part of the Event Handling side in your application most often used to create views. This idea also ties in with your questions, that you want to know the "state of the Account Aggregate" at a given point in time. That's not a command, but a query, as you have 'a request for data' instead of 'the request to change state'.
Hope this helps you out #Safe!

Compensating Events on CQRS/ES Architecture

So, I'm working on a CQRS/ES project in which we are having some doubts about how to handle trivial problems that would be easy to handle in other architectures
My scenario is the following:
I have a customer CRUD REST API and each customer has unique document(number), so when I'm registering a new customer I have to verify if there is another customer with that document to avoid duplicity, but when it comes to a CQRS/ES architecture where we have eventual consistency, I found out that this kind of validations can be very hard to address.
It is important to notice that my problem is not across microservices, but between the command application and the query application of the same microservice.
Also we are using eventstore.
My current solution:
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%. That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
Altough this works, there are 2 things that bother me here, the first thing is my command application relying on the query application, so if my query application is down, my command is affected (today I just return false on my validation if query is down but still...) and second thing is, should a query/read model really be able to emit events? And if so, what is the correct way of doing it? Should the command have some kind of API for that? Or should the query emit the event directly to eventstore using some common shared library? And if I have more than one view/read? Which one should I choose to handle this?
Really hope someone could shine a light into these questions and help me this these matters.
For reference, you may want to be reviewing what Greg Young has written about Set Validation.
I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right?
That's exactly right - your read model is stale copy, and may not have all of the information collected by the write model.
That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
This spelling doesn't quite match the usual designs. The more common implementation is that, if we detect a problem when reading data, we send a command message to the write model, telling it to straighten things out.
This is commonly referred to as a process manager, but you can think of it as the automation of a human supervisor of the system. Conceptually, a process manager is an event sourced collection of messages to be sent to the command model.
You might also want to consider whether you are modeling your domain correctly. If documents are supposed to be unique, then maybe the command model should be using the document number as a key in the book of record, rather than using the customer. Or perhaps the document id should be a function of the customer data, rather than being an arbitrary input.
as far as I know, eventstore doesn't have transactions across different streams
Right - one of the things you really need to be thinking about in general is where your stream boundaries lie. If set validation has significant business value, then you really need to be thinking about getting the entire set into a single stream (or by finding a way to constrain uniqueness without using a set).
How should I send a command message to the write model? via API? via a message broker like Kafka?
That's plumbing; it doesn't really matter how you do it, so long as you are sure that the command runs within its own transaction/unit of work.
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%.
No, you cannot safely rely on the query side, which is eventually consistent, to prevent the system to step into an invalid state.
You have two options:
You permit the system to enter in a temporary, pending state and then, eventually, you will bring it into a valid permanent state; for this you could allow the command to pass, yield CustomerRegistered event and using a Saga/Process manager you verify against a uniquely-indexed-by-document-collection and issue a compensating command (not event!), i.e. UnregisterCustomer.
Instead of sending a command, you create&start a Saga/Process that preallocates the document in a uniquely-indexed-by-document-collection and if successfully then send the RegisterCustomer command. You can model the Saga as an entity.
So, in both solution you use a Saga/Process manager. In order for the system to be resilient you should make sure that RegisterCustomer command is idempotent (so you can resend it if the Saga fails/is restarted)
You've butted up against a fairly common problem. I think the other answer by VoicOfUnreason is worth reading. I just wanted to make you aware of a few more options.
A simple approach I have used in the past is to create a lookup table. Your command tries to register the key in a unique constraint table. If it can reserve the key the command can go ahead.
Depending on the nature of the data and the domain you could let this 'problem' occur and raise additional events to mark it. If it is something that's important to the business/the way the application works then you can deal with it either manually or at the time via compensating commands. if the latter then it would make sense to use a process manager.
In some (rare) cases where speed/capacity is less of an issue then you could consider old-fashioned locking and transactions. Admittedly these are much better suited to CRUD style implementations but they can be used in CQRS/ES.
I have more detail on this in my blog post: How to Handle Set Based Consistency Validation in CQRS
I hope you find it helpful.

Is it ok to have FAT events with event sourcing?

I have recently been building an application on top of Greg Young EventStore as my peristance layer and I have been pondering how big should I allow an event to get?
For example I have an UK Address Aggregate with the following fields
UK_Address
-BuildingName
-Street
-Locality
-Town
-Postcode
Now I'm building the UI using React/Redux and was thinking should I create a single FAT addressUpdated Event contatining all the above fields?
Or should I Create a event for each of the different fields? and batch them within the client until the Save event is fired? buildingNameUpdated Event, streetUpdated Event, localityUpdated Event.
I'm not sure if the answer is as black and white ask I have asked it what I really would like to know is what conditions/constraints could you use to make the decision?
should I create a event for each of the different fields?
No. The representations of your events are part of the API -- so you want to use spellings that make sense at the level of the business, not at the level of the implementation.
Now I'm building the UI using React/Redux and was thinking should I create a single FAT updateAddress Event containing all the above fields?
You don't need to constrain the data that you send to your UI to match that which is in the persistence store. The UI is just a cached representation of a read model; there's no reason that representation needs to have the same form as what is in your event store.
Consider the React model itself -- your code makes changes to the "in memory" representation of your data, and then the library computes the new DOM and replaces it, which in turn causes the browser to update its view, which in turn causes the pixels on the screen to change.
So taking a fat event from the store, and breaking it into field level events for the UI is fine. Taking multiple events from the store and aggregating them into a single message for the UI is also fine. Taking events from the event store and transforming them into a spelling that the UI will recognize is also fine.
Do you have any comment regarding Arien answer regarding keeping fields that need to be consistent together? so regardless of when your snapshop the current state of the world it would be in a valid state?
I don't believe that this makes sense, and I'm not sure if it is possible in general.
It doesn't make sense, because "valid state" is a write model concern only; events are things that have happened, its too late to vote on whether they are valid or not. For instance, if you deploy a new model, with a new invariant, it still needs to respect the history of what happened before. So you can build a snapshot for that new model, but the snapshot may not be "valid". Too bad.
Given that, I don't think it makes sense to worry over whether each individual event in a commit leaves the snapshot in a valid state.
In particular, if a particular transaction involves multiple entities, it is very likely that the domain language will suggest an event for each entity (we "debit cash" and "credit accounts receivable"). The entities themselves, of course, are capable of changing independently of each other -- it's the aggregate that maintains the balance.
You have to bundle al the information together in one event when this data has to be consistent with each other.
So when you update one field of an address you probably get an unwanted address.
This will happen when the client has not processed all the events at a certain time due to eventual consistency.
Example:
Change address (City=1, Street=1, Housenumber=1) to (City=2, Street=2, Housenumber=2)
When you do this with 3 events and you have just processed one at the time of reading you could get the address: (City=2, Street=1, Housenumber=1).
If puzzled, give a try to a solution that is easier to implement. I guess "FAT" event will be easier: you will end up spending less time for implementing/debugging/supporting.
It is usually referred as YAGNI-KISS-Occam's Razor principles.
In theory and I find it to be a good rule of thumb is to have your commands and events reflecting the intent of the user staying true to DDD. You can find a good explanation of the pros and cons about event granularity here: https://medium.com/#hugo.oliveira.rocha/what-they-dont-tell-you-about-event-sourcing-6afc23c69e9a

How to update/migrate data when using CQRS and an EventStore?

So I'm currently diving the CQRS architecture along with the EventStore "pattern".
It opens applications to a new dimension of scalability and flexibility as well as testing.
However I'm still stuck on how to properly handle data migration.
Here is a concrete use case:
Let's say I want to manage a blog with articles and comments.
On the write side, I'm using MySQL, and on the read side ElasticSearch, now every time a I process a Command, I persist the data on the write side, dispatch an Event to persist the data on the read side.
Now lets say I've some sort of ViewModel called ArticleSummary which contains an id, and a title.
I've a new feature request, to include the article tags to my ArticleSummary, I would add some dictionary to my model to include the tags.
Given the tags did already exist in my write layer, I would need to update or use a new "table" to properly use the new included data.
I'm aware of the EventLog Replay strategy which consists in replaying all the events to "update" all the ViewModel, but, seriously, is it viable when we do have a billion of rows?
Is there any proven strategies? Any feedbacks?
I'm aware of the EventLog Replay strategy which consists in replaying
all the events to "update" all the ViewModel, but, seriously, is it
viable when we do have a billion of rows?
I would say "yes" :)
You are going to write a handler for the new summary feature that would update your query side anyway. So you already have the code. Writing special once-off migration code may not buy you all that much. I would go with migration code when you have to do an initial update of, say, a new system that requires some data transformation once off, but in this case your infrastructure would exist.
You would need to send only the relevant events to the new handler so you also wouldn't replay everything.
In any event, if you have a billion rows of data your servers would probably be able to handle the load :)
Im currently using the NEventStore by JOliver.
When we started, we were replaying our entire store back through our denormalizers/event handlers when the application started up.
We were initially keeping all our data in memory but knew this approach wouldn't be viable in the long term.
The approach we use currently is that we can replay an individual denormalizer, which makes things a lot faster since you aren't unnecessarily replaying events through denomalizers that haven't changed.
The trick we found though was that we needed another representation of our commits so we could query all the events that we handled by event type - a query that cannot be performed against the normal store.

ETL , Esper or Drools?

The question environment relates to JavaEE, Spring
I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store.
So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90
Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler.
My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules.
One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI)
The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent).
Please ask for explanations, I have the feeling that I need to revise this question a bit.
Thanks
Well Esper is a great CEP engine, but drools has it's own implementation Drools Fusion which integrates really well with jBpm. That would be a good choice.

Resources