How to handle a legal enforced data delete request in an event sourced system? - event-sourcing

In an event sourced system, historic data in the form of events is never thrown away. Doing so could result in a corrupted state. Now imagine there is a court ruling, stating some data needs to be deleted (for example, search engines had to delete privacy specific data). How would you achieve this?

That's a really good question.
So far, I've learned of two possibilities.
Easy part first: if you are using event sourcing, then all of your views of your data should be derivable from the events in your event store. Therefore, all of the data that you have stored for reading (caches, screens, projections, reports) can be blown away and regenerated after you scrub the tainted data from the event store.
So you only need to figure out that part.
First, if the tainted data never gets into the store, you don't have to worry about scrubbing it out. For instance, sensitive information can be isolated in a key value store; references to that data in the event store are always by surrogate key. When you need to scrub, the data in the key value store is nuked, you have a bunch of events that point to something no longer readable, and you just need to ensure that your read models can continue to function if the referenced data is not available.
If the data does need to get into the event store -- because it's needed to maintain the integrity of the domain model -- then the idea of "aggregates" may be able to help.
Aggregates is an idea taken from ddd, the basic idea is that your domain can be decomposed into elements that don't need to share data directly. On aggregate never references data within another directly; instead you use indirect references by ID; the ID itself being another surrogate key.
Since these aggregates are isolated from each other, they can have their own event history. In which case you can scrub the tainted data by simply eliminating any aggregates that have been contaminated. You just delete the event streams.
A response like this doesn't put you in a corrupted state, just an inconsistent one. Everything still runs, there's just a bunch of data missing.
There's also the weapon of a "compensating event" available in the toolkit; you might be able to introduce a new stream of events that brings the system back to a consistent state. For example, if scrubbing a bunch of transactions takes the books out of balance, you may be able to publish an event that creates a charge against iCouldTellYouButThen....

Related

Making sure you don't add same person data twice using EventSourcing

I am wondering how you make sure you are not adding the same person twice in your EventStore?
lets say that on you application you add person data but you want to make sure that the same person name and birthday is not added twice in different streams.
Do you ask you ReadModels or do you do it within your Evenstore?
I am wondering how you make sure you are not adding the same person twice in your EventStore?
The generalized form of the problem that you are trying to solve is set validation.
Step #1 is to push back really hard on the requirement to ensure that the data is always unique - if it doesn't have to be unique always, then you can use a detect and correct approach. See Memories, Guesses, and Apologies by Pat Helland. Roughly translated, you do the best you can with the information you have, and back up if it turns out you have to revert an error.
If a uniqueness violation would expose you to unacceptable risk (for instance, getting sued to bankruptcy because the duplication violated government mandated privacy requirements), then you have to work.
To validate set uniqueness you need to lock the entire set; this lock could be pessimistic or optimistic in implementation. That's relatively straight forward when the entire set is stored in one place (which is to say, under a single lock), but something of a nightmare when the set is distributed (aka multiple databases).
If your set is an aggregate (meaning that the members of the set are being treated as a single whole for purposes of update), then the mechanics of DDD are straightforward. Load the set into memory from the "repository", make changes to the set, persist the changes.
This design is fine with event sourcing where each aggregate has a single stream -- you guard against races by locking "the" stream.
Most people don't want this design, because the members of the set are big, and for most data you need only a tiny slice of that data, so loading/storing the entire set in working memory is wasteful.
So what they do instead is move the responsibility for maintaining the uniqueness property from the domain model to the storage. RDBMS solutions are really good at sets. You define the constraint that maintains the property, and the database ensures that no writes which violate the constraint are permitted.
If your event store is a relational database, you can do the same thing -- the event stream and the table maintaining your set invariant are updated together within the same transaction.
If your event store isn't a relational database? Well, again, you have to look at money -- if the risk is high enough, then you have to discard plumbing that doesn't let you solve the problem with plumbing that does.
In some cases, there is another approach: encoding the information that needs to be unique into the stream identifier. The stream comes to represent "All users named Bob", and then your domain model can make sure that the Bob stream contains at most one active user at a time.
Then you start needing to think about whether the name Bob is stable, and which trade-offs you are willing to make when an unstable name changes.
Names of people is a particularly miserable problem, because none of the things we believe about names are true. So you get all of the usual problems with uniqueness, dialed up to eleven.
If you are going to validate this kind of thing then it should be done in the aggregate itself IMO, and you'd have to use use read models for that like you say. But you end up infrastructure code/dependencies being sent into your aggregates/passed into your methods.
In this case I'd suggest creating a read model of Person.Id, Person.Name, Person.Birthday and then instead of creating a Person directly, create some service which uses the read model table to look up whether or not a row exists and either give you that aggregate back or create a new one and give that back. Then you won't need to validate at all, so long as all Person-creation is done via this service.

In micro service with Event sourcing, should i save command or event into service db or should it be one big db?

As question says, whats the best practice for storing commands and event?!
should i store only commands, since commands will generate the events?!
As question says, whats the best practice for storing commands and event?! should i store only commands, since commands will generate the events?!
Storing only "commands" works in some settings. For example, if you review what the team at LMAX was sharing about their designs, you'll see that what they were writing into their journals were the input messsages.
In their context, they didn't need to worry that the underlying domain model of the process was going to change (that would happen during the daily maintenance window, when everything was quiet), so there was never any question what the state of the system would be after a given sequence of events.
But event-sourcing is normally understood to mean saving a representation of the state of the system -- just that instead of overwriting our data structure, we are extending it (think linked list of changes). The changes we persist tend to be the observable effects of the inputs, rather than the raw inputs alone.
In the Event Sourcing architecture events are source of your state. So you need to store events - the facts recorded in your system.

What does data look like when using Event Sourcing?

I'm trying to understand how Event Sourcing changes the data architecture of a service. I've been doing a lot of research, but I can't seem to understand how data is supposed to be properly stored with event sourcing.
Let's say I have a service that keeps track of vehicles transporting packages. The current non relational structure for the data model is that each document represents a vehicle, and has many fields representing origin location, destination location, types of packages, amount of packages, status of the vehicle, etc. Normally this gets queried for information to be read to the front end. When changes are made by the user, the appropriate changes are made to this document in order to update this.
With event sourcing, it seems that a snapshot of every event is stored, but there seem to be a few ways to interpret that:
The first is that the multiple versions of the document I described exist, each a new snapshot every time a change is made. Each event would create a new version of this document and alter it. This is the easiest way for me to wrap my head around it, but I believe this to be incorrect.
Another interpretation I have is that each event stores SPECIFIC information about what's been altered in the document. When the vehicle status changes from On Road to Available, for example, an event specifically for vehicle status changes is triggered. Let's say it's called VehicleStatusUpdatedEvent, and contains the Vehicle ID number, the new status, and the timestamp for this event. So this event is stored and is published to a messaging queue. When picked up from the queue, the appropriate changes are made to the current version of the document. I can understand this, but I think I still have some misconceptions here. My understanding is that event sourcing allows us to have a snapshot of data upon each change, so we can know what it looks like at any point. What I just described would keep a log of changes, but still only have one version of the file, as the events only contain specific pieces of the whole file.
Can someone describe how the data flow and architecture works with event sourcing? Using the vehicle data example I provided might help me frame it better. I feel that I am close to understanding this, but I am missing some fundamental pieces that I can't seem to understand by searching online.
The current non relational structure for the data model is that each document represents a vehicle
OK, let's start from there.
In the data model you've described, storage of a document destroys the earlier copy.
Now imagine that instead we were storing the the document in a git repository. Then then saving the document would also save metadata, and that metadata would include a pointer to the previous document.
Of course, we've probably got a lot of duplication in that case. So instead of storing the complete document every time, we'll store a patch document (think JSON Patch), and metadata pointing to the original patch.
Take that same idea again, but instead of storing generic patch documents, we use domain specific messages that describe what is going on in terms of the model.
That's what the data model of an event sourced entity looks like: a list of domain specific descriptions of document transformations.
When you need to reconstitute the current state, you start with a state you know (which could be the "null" state of the document before anything happened to it, and replay onto that document all of the patches (events) that have occurred since.
If you want to do a temporal query, the game is the same, you replay the events up to the point in time that you are interested in.
So essentially when referring to an older build, you reconstruct the document using the events, correct?
Yes, that's exactly right.
So is there still a "current status" document or is that considered bad practice?
"It depends". In the general case, there is no current status document; only the write-ordered list of events is "real", and everything else is derived from that.
Conversations about event sourcing often lead to consideration of dedicated message stores for managing persistence of those ordered lists, and it is common that the message stores do not also support document storage. So trying to keep a "current version" around would require commits to two different stores.
At this point, designers typically either decide that "recent version" is good enough, in which case they build eventually consistent representations of documents outside of the transaction boundary... OR they decide current version is important, and look into storage solutions that support storing the current version in the same transaction as the events (ex: using an RDBMS).
what is the procedure used to generate the snapshot you want using the events?
IF you want to generate a snapshot, then you'll normally end up using a pattern called a projection, to iterate over the events and either fold or reduce them to create the document.
Roughly, you have a function somewhere that looks like
document-with-meta-data = projection(event-history-with-metadata)

(why) is FSCTL_SET_OBJECT_ID dangerous?

NTFS files can have object ids. These ids can be set using FSCTL_SET_OBJECT_ID. However, the msdn article says:
Modifying an object identifier can result in the loss of data from portions of a file, up to and including entire volumes of data.
But it doesn't go into any more detail. How can this result in loss of data? Is it talking about potential object id collisions in the file system, and does NTFS rely on them in some way?
Side node: I did some experimenting with this before I found that paragraph, and set the object id's of some newly created files, here's hoping that my file system's still intact.
I really don't think this can directly result in loss of data.
The only way I can imagine it being possible is if e.g. a backup program assumes that (1) every file has an Object Id, and (2) that the program is keeping track of all IDs at all times. In that case it might assume that an ID that is not in its database must refer to a file that should not exist, and it might delete the file.
Yeah, I know it sounds ridiculous, but that's the only way I can think of in which this might happen. I don't think you can lose data just by changing IDs.
They are used by distributed link tracking service which enables client applications to track link sources that have moved. The link tracking service maintains its link to an object only by using these object identifier (ID).
So coming back to your question,
Is it talking about potential object id collisions in the file system
?
I dont think so. Windows does provides us the option to set the object IDs using FSCTL_SET_OBJECT_ID but that doesnt bring the risk of ID collision.
Attempting to set an object identifier on an object that already has an object identifier will fail.
.. and does NTFS rely on them in some way?
Yes. Object identifiers are used to track files and directories. An index of all object IDs is stored on the volume. Rename, backup, and restore operations preserve object IDs. However, copy operations do not preserve object IDs, because that would violate their uniqueness.
How can this result in loss of data?
You wont get into a serious problem if you change(or rather set) object ID of user-created files(as you did). However, if a user(knowingly/unknowingly) sets object ID used by a shared object file/library, change will not be reflected as is.
Since Windows doesnt want everyone(but developers) to play with crutial library files, it issues a generic warning:
Modifying an object identifier can result in the loss of data from
portions of a file, up to and including entire volumes of data.
Bottom line: Change it if you know what you are doing.
There's another msn article on distributed link tracking and object identifiers.
Hope it helps!
EDIT:
Thanks to #Mehrdad for pointing out.I didnt mean object identifiers of DLLs themselves but ones which they use internally.
OLEACC(a dll), provides the Active Accessibility runtime and manages requests from Active Accessibility clients[source]. It use OBJID_QUERYCLASSNAMEIDX object identifier [ source ]

Using Core Data as cache

I am using Core Data for its storage features. At some point I make external API calls that require me to update the local object graph. My current (dumb) plan is to clear out all instances of old NSManagedObjects (regardless if they have been updated) and replace them with their new equivalents -- a trump merge policy of sorts.
I feel like there is a better way to do this. I have unique identifiers from the server, so I should be able to match them to my objects in the store. Is there a way to do this without manually fetching objects from the context by their identifiers and resetting each property? Is there a way for me to just create a completely new context, regenerate the object graph, and just give it to Core Data to merge based on their unique identifiers?
Your strategy of matching, based on the server's unique IDs, is a good approach. Hopefully you can get your server to deliver only the objects that have changed since the time of your last update (which you will keep track of, and provide in the server call).
In order to update the Core Data objects, though, you will have to fetch them, instantiate the NSManagedObjects, make the changes, and save them. You can do this all in a background thread (child context, performBlock:), but you'll still have to round-trip your objects into memory and back to store. Doing it in a child context and its own thread will keep your UI snappy, but you'll still have to do the processing.
Another idea: In the last day or so I've been reading about AFIncrementalStore, an NSIncrementalStore implementation which uses AFNetworking to provide Core Data properties on demand, caching locally. I haven't built anything with it yet but it looks pretty slick. It sounds like your project might be a good use of this library. Code is on GitHub: https://github.com/AFNetworking/AFIncrementalStore.

Resources