Consisntent N1QL Query Couchbase GOCB sdk - go

I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?

Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.

Related

CQRS Where to Query for business logic/Internal Processes

I'm currently looking at implementing CQRS driven by events (not yet event sourcing) in for a service at work; the reasoning being:
I need aggregate data to support a RestAPI coming out of this service (which will be used to populate views)- however the aggregated data will not be used by the application logic/processing (ie the data originating outside this service, the bits that of the aggregate originating within it will be used)
I need to stream events to other systems so that they can react to the data (will produce to a Kafka topic, so the 'read'/'projection' side of this system will consume the same events as the external systems, from these Kafka topics
I will be consuming events from internal systems to help populate the aggregate for the views in first point (ie it's data from this service and other's)
The reason for not going event sourced currently is that a) we're in a bit of a time crunch, and b) due to still learning about it. Having said which, it is something that we are looking to do in the future- though currently, we have a static DB in the 'Command' side of the system, which will just store current state
I'm pretty confident with the concept of using the aggregate data to provide the Rest API; however my confusion is coming from when I want to change a resource from within the system (for example via a cron job triggered 5 times a day) Example:
If I have resource of class x, which (given some data), wants a piece of state changing
I need to select instances of the class x which meet the requirements (from one of the DB's). Think select * from {class x} where last_changed_ date > 5 days ago;
Then create a command to change the state of these instances of x (in my case, the static command DB would be updated, as well as an event made to update the read DB)
The middle bullet point is what is confusing me. If I pull the data out of the Read DB, and check some information on it, then decide to change a property; I then have to convert the object from the 'Read Object' to the 'Command Object', so that I can then persist it and create an event? With my current architecture- I could query the command DB no problem, to find all the instances of {class x} that match the criteria, however I don't know if a) this is the right thing to do, and b) how this would work if I was using an event store as a DB? I'd have to query a table with millions of rows to find the most recent bit of state about the objects, to then see if they match?
Lots of what I read online has been very conceptual- so I think when it comes to implementations it maybe seems more difficult than it is? Anyhow, if anyone has any advice it would be hugely appreciated!
TIA :)
CQRS can be interpreted in a "permissive" way: rather than saying "thou shalt not query the command/write side", it says "it's OK to have a query/read side that's separate from the command/write side". Because you have this permission to do such separation, it follows that one can optimize the command/write side for a more write-heavy workload (in practice, there are always some reads in the command/write side: since command validation is typically done against some state, that requires some means of getting the state!). From this, it's extremely likely that there will be some queries which can be performed efficiently against the command/write side and some that can't be (without deoptimizing the command/write side). From this perspective, it's OK to perform the first kind of query against the command/write side: you can get the benefit of strong consistency by doing that, though be sure to make sure that you're not affecting the command/write side's primary raison d'etre of taking writes.
Event sourcing is in many ways the maximally optimized persistence model for a command/write side, especially if you have some means of keeping the absolute latest state cached and ensuring concurrency control. This is because you can then have many times more writes than reads. The tradeoff in event sourcing is that nearly all reads become rather more expensive than in an update-in-place model: it's thus generally the case that CQRS doesn't force event sourcing but event sourcing tends to force CQRS (and in turn, event sourcing can simplify ensuring that a CQRS system is eventually consistent, which can be difficult to ensure with update-in-place).
In an event-sourced system, you would tend to have a read-side which subscribes to the event stream and tracks the mapping of X ID to last updated and which periodically queries and issues commands. Alternatively, you can have a scheduler service that lets you say "issue this command at this time, unless canceled or rescheduled before then" and a read-side which subscribes to updates and schedules a command for the given ID 5 days from now after canceling the command from the previous update.

KStreams: implementing session window with pocessor API

I need to implement a logic similar to session windows using processor API in order to have a full control over state store. Since processor API doesn't provide windowing abstraction, this needs to be done manually. However, I fail to find the source code for KStreams session window logic, to get some initial ideas (specifically regarding session timeouts).
I was expecting to use punctuate method, but it's a per processor timer rather than per key timer. Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
[UPDATE]
As an example, assume processor instance is processing K1 and stream time is incremented which causes the session for K2 to timeout. K2 may or may not exist at all. How do you know that there exists a specific key (like K2 when stream time is incremented (while processing a different key)? In other words when stream time is incremented, how do you figure out which windows are expired (because you don't know those keys exists)?
This is the DSL code: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java -- hope it helps.
It's unclear what your question is though -- it's mostly statements. So let me try to give some general answer.
In the DSL, sessions are close based on "stream time" progress. Only relying on the input data makes the operation deterministic. Using wall-clock time would introduce non-determinism. Hence, using a Punctuation is not necessary in the DSL implementation.
Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
Sessions in the DSL are based on keys and thus it's sufficient to scan the store on a per-key basis over a time range (as done via findSessions(...)).
Update:
In the DSL, each time a session window is updated, as corresponding update event is sent downstream immediately. Hence, the DSL implementation does not wait for "stream time" to advance any further but publishes the current (potentially intermediate) result right away.
To obey the grace period, the record timestamp is compared to "stream time" and if the corresponding session window is already closed, the record is skipped (cf. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java#L146). I.e., closing a window is just a logical step (not an actually operation); the session will still be stored and if a window is closed no additional event needs to be sent downstream because the final result was sent downstream in the last update to the window already.
Retention time itself must not be handled by the Processor implementation because it's a built-in feature of the SessionStore: internally, the session store maintains so-called "segments" that store sessions for a certain time period. Each time a put() is done, the store checks if old segments can be dropped (based on the timestamp provided by put()). I.e., old sessions are deleted lazily and as bulk deletes (i.e., all session of the whole segment will be deleted at once) as it's more efficient than individual deletes.

How get a data without polling?

This is more of a theorical question.
Well, imagine that I have two programas that work simultaneously, the main one only do something when he receives a flag marked with true from a secondary program. So, this main program has a function that will keep asking to the secondary for the value of the flag, and when it gets true, it will do something.
What I learned at college is that the polling is the simplest way of doing that. But when I started working as an developer, coworkers told me that this method generate some overhead or it's waste of computation, by asking every certain amount of time for a value.
I tried to come up with some ideas for doing this in a different way, searched on the internet for something like this, but didn't found a useful way about how to do this.
I read about interruptions and passive ways that can cause the main program to get that data only if was informed by the secondary program. But how this happen? The main program will need a function to check for interruption right? So it will not end the same way as before?
What could I do differently?
There is no magic...
no program will guess when it has new information to be read, what you can do is decide between two approaches,
A -> asks -> B
A <- is informed <- B
whenever use each? it depends in many other factors like:
1- how fast you need the data be delivered from the moment it is generated? as far as possible? or keep a while and acumulate
2- how fast the data is generated?
3- how many simoultaneuos clients are requesting data at same server
4- what type of data you deal with? persistent? fast-changing?
If you are building something like a stocks analyzer where you need to ask the price of stocks everysecond (and it will change also everysecond) the approach you mentioned may be the best
if you are writing a chat based app like whatsapp where you need to check if there is some new message to the client and most of time wont... publish subscribe may be the best
but all of this is a very superficial look into a high impact architecture decision, it is not possible to get the best by just looking one factor
what i want to show is that
coworkers told me that this method generate some overhead or it's
waste of computation
it is not a right statement, it may be in some particular scenario but overhead will always exist in distributed systems
The typical way to prevent polling is by using the Publish/Subscribe pattern.
Your client program will subscribe to the server program and when an event occurs, the server program will publish to all its subscribers for them to handle however they need to.
If you flip the order of the requests you end up with something more similar to a standard web API. Your main program (left in your example) would be a server listening for requests. The secondary program would be a client hitting an endpoint on the server to trigger an event.
There's many ways to accomplish this in every language and it doesn't have to be tied to tcp/ip requests.
I'll add a few links for you shortly.
Well, in most of languages you won't implement such a low level. But theorically speaking, there are different waiting strategies, you are talking about active waiting. Doing this you can easily eat all your memory.
Most of languages implements libraries to allow you to start a process as a service which is at passive waiting and it is triggered when a request comes.

Compensating Events on CQRS/ES Architecture

So, I'm working on a CQRS/ES project in which we are having some doubts about how to handle trivial problems that would be easy to handle in other architectures
My scenario is the following:
I have a customer CRUD REST API and each customer has unique document(number), so when I'm registering a new customer I have to verify if there is another customer with that document to avoid duplicity, but when it comes to a CQRS/ES architecture where we have eventual consistency, I found out that this kind of validations can be very hard to address.
It is important to notice that my problem is not across microservices, but between the command application and the query application of the same microservice.
Also we are using eventstore.
My current solution:
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%. That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
Altough this works, there are 2 things that bother me here, the first thing is my command application relying on the query application, so if my query application is down, my command is affected (today I just return false on my validation if query is down but still...) and second thing is, should a query/read model really be able to emit events? And if so, what is the correct way of doing it? Should the command have some kind of API for that? Or should the query emit the event directly to eventstore using some common shared library? And if I have more than one view/read? Which one should I choose to handle this?
Really hope someone could shine a light into these questions and help me this these matters.
For reference, you may want to be reviewing what Greg Young has written about Set Validation.
I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right?
That's exactly right - your read model is stale copy, and may not have all of the information collected by the write model.
That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
This spelling doesn't quite match the usual designs. The more common implementation is that, if we detect a problem when reading data, we send a command message to the write model, telling it to straighten things out.
This is commonly referred to as a process manager, but you can think of it as the automation of a human supervisor of the system. Conceptually, a process manager is an event sourced collection of messages to be sent to the command model.
You might also want to consider whether you are modeling your domain correctly. If documents are supposed to be unique, then maybe the command model should be using the document number as a key in the book of record, rather than using the customer. Or perhaps the document id should be a function of the customer data, rather than being an arbitrary input.
as far as I know, eventstore doesn't have transactions across different streams
Right - one of the things you really need to be thinking about in general is where your stream boundaries lie. If set validation has significant business value, then you really need to be thinking about getting the entire set into a single stream (or by finding a way to constrain uniqueness without using a set).
How should I send a command message to the write model? via API? via a message broker like Kafka?
That's plumbing; it doesn't really matter how you do it, so long as you are sure that the command runs within its own transaction/unit of work.
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%.
No, you cannot safely rely on the query side, which is eventually consistent, to prevent the system to step into an invalid state.
You have two options:
You permit the system to enter in a temporary, pending state and then, eventually, you will bring it into a valid permanent state; for this you could allow the command to pass, yield CustomerRegistered event and using a Saga/Process manager you verify against a uniquely-indexed-by-document-collection and issue a compensating command (not event!), i.e. UnregisterCustomer.
Instead of sending a command, you create&start a Saga/Process that preallocates the document in a uniquely-indexed-by-document-collection and if successfully then send the RegisterCustomer command. You can model the Saga as an entity.
So, in both solution you use a Saga/Process manager. In order for the system to be resilient you should make sure that RegisterCustomer command is idempotent (so you can resend it if the Saga fails/is restarted)
You've butted up against a fairly common problem. I think the other answer by VoicOfUnreason is worth reading. I just wanted to make you aware of a few more options.
A simple approach I have used in the past is to create a lookup table. Your command tries to register the key in a unique constraint table. If it can reserve the key the command can go ahead.
Depending on the nature of the data and the domain you could let this 'problem' occur and raise additional events to mark it. If it is something that's important to the business/the way the application works then you can deal with it either manually or at the time via compensating commands. if the latter then it would make sense to use a process manager.
In some (rare) cases where speed/capacity is less of an issue then you could consider old-fashioned locking and transactions. Admittedly these are much better suited to CRUD style implementations but they can be used in CQRS/ES.
I have more detail on this in my blog post: How to Handle Set Based Consistency Validation in CQRS
I hope you find it helpful.

Auto save performance for rdbms

In my app user types in some content which I would like to auto save as the user types. The save call is not for every keystroke, rather I do autosave only when user pauses for more than 200ms. So in a typical paragraph there are 15-20 server calls. The content will not be read very often, so I need to optimize the writes.
I have to save data on MSSQL Server because of legacy code reasons. I'm getting 10 seconds avg response time in my load test. How do I improve the performance?
One approach I'm considering is instead of directly saving data in mssql I'll save it in Cassandra or redis, then eventually(maybe at regular time intervals) write it to mssql.
Another approach is instead of doing frequent updates, I'll insert new record for each auto save. Then a background process will clean up all records except for latest, every few minutes.
Update:
I replaced the existing logic with simple update calls to 2 tables and now I am seeing improvements. There was a long stored procedure which was taking upto 10 seconds under load. SO for now I have hold on the problem. Still I would like to know is there something I can do on application server layer to reduce frequent DB calls.
It is quite hard to answer yor question directly but here are some hints based on what we do in a multiple active user situation.
If you are writing/triggering on every keystroke, pass the keystroke to a background thread and do not perform the database write, or any network call, while blocking the users typing. A fast typist can hit 20 keystrokes/second, and you cannot afford to introduce latency.
If recording on a web page, you might be able to use localStorage. Do not issue an AJAX style call on every keystroke as there is a limit to outstanding requests. You need to implement some kind of buffered send. Remember that network calls in the real world can be 300mS sort of scale just to traverse the network.
Do you really need to save every keystroke, or is every N seconds acceptable? Every save operation will eventually turn into a disk operation, so you really want to coalesce as many saves as possible. The quickest way to do something is not to do it at all.
If you are recording to a database, then it is often quicker to update an existing row, if you can fetch it by direct key first. Unfortunatly it can sometimes be quicker to insert a new row and clean up excess later. This tends to be true if the table has few indexes. Which is quicker depends on database engine in use and how it is being used. We use both methods.
When using a database keep in mind that they often keep journals of some kind, so if you are updating frequently you might create a large load on the journal files.
If you are using techniques (Using C terminology) like fopen, fwrite these can perform very well, but if you are worried about system failure recovery, you may need to call fsync, which then limits your maximum performance rate. If you need fsync, a database might be better.
You might like to consider writing to a transactionlog table very frequently, and then posting to the real storage every N seconds. For example, if I am typing a customers name I might record every keystroke into a keylog table, and then have a background job read the keylog table and transfer the data to customers table. This helps reduce the operations to the customers table while also allowing the keylog table to be optimised to recording keystrokes. But, at the cost of more code server side.
Overall, you want logic like this
On keyup handler
Add keystroke to background queue
Wake background thread
Background thread
Read/remove ALL data from background queue
If no data, wait for wakeup and repeat
Write to database/network/file etc as one operation. (this can now be syncronous calls)
Optionally some velocity control, simple one is sleep(50mS) or sleep(2s)
Repeat
Keep in mind with the above the user can type and immediately hit close, so your final buffer write might not have flushed yet. You need to handle this.
If you get this correct, the user will not notice any delay. In our usage, we are recording around 1000 keystrokes/sec average, all of which ar routed over private networks to central points. This load is barely a blip, even network monitoring does not see such a small amount of traffic.
Good luck.

Resources