Filter out duplicates and only generate events when something has changed - events

We have a system that sends us events when any attribute has changed on an order. They have a full order structure with say 500 attributes and we only storing a subset of them say 50 attributes + some transformation to our internal order data model. Now whenever any of these 500 attributes on the order changes at the source system, it triggers an event to us (note that they share the complete order state every time). Most of the times nothing changes on the 50 attributes we have in our model so many are duplicate events for us. Currently we are storing all the events we receive from the source in a database.
The requirement now is that we have to generate events to other consumers when the order data in our system has changed. So we will have to eliminate the duplicates from the source and only share the event when some attribute on our model has changed.
Possible solution:
For every event we receive from source we could compare it with our internal data if something has changed on "any" attribute. We could have a memory database like REDIS. But there is going to be a lot of IO on the database. Also not sure whats that best way to compare if 2 json objects are exactly the same.
Other solution is that, instead of storing all attributes we could store only some important "meta data" and compare on that meta data only. Will be good for performance but will not be fully perfect if we have to generate event for any attribute change.
Would like to know how would you design this scenario.
Thanks.

Related

Transform data before adding to cache system or after when reading it

I have this situation, which I don't know which could fit better.
I have this solution where I search for soccer players, I only have their names and teams, but when a user comes to my website and clicks on the player I will use detailed information from the player that I get from various external providers (usually based by country).
I know which external provider to use when a call is done, and I pay to the external providers each time I grab data, so to mitigate this, I will try to get the less times possible, so I grab once a user clicks on the player info, and next time if it's in my database cache I will show my cache info. After 10 days I will grab again for the specific player form the external provider as I want the info to be somehow updated.
I will need to transform different providers data that come, usually, as JSON in my own structure so I can handle it the right way, I have my own object structure, so the fields coming from the external providers fall/map/transform in my code always with the same naming and structure..
So, my problem is to decide when should I map/transform data coming form the providers.
I grab data from the provider, I transform it to my JSON structure and record/keep it in the database cache system this once with my main structure, and in my solution code all I need is, everytime a user clicks on the soccer player details I get from the database cache this JSON field and convert it directly to object I know how to use.
I grab data from the provider, keep it as is in my database cache system, and in my solution code everytime someone clicks to get the soccer player detail info I get the JSON record from my database cache, I transform it for my naming and structure, and convert it to object
Notes:
- this is a cache database, records won't be kept forever, if in a call I see the record have more then 10 days I will get new data from the apropriate external provider
Deciding the layer to cache data is an art form all its own. The higher the layer you cache data, the more performant it will be (less reprocessing needed), but the lower the re-use potential will be (different parts of the application may use the same cache, and find value if it hasn’t been transformed too much).
Yours is another case of this. If you store it as the provider provides it, and you need to change the way you transform it, you won’t have to pay to re-retrieve it. If on the other hand, you store it as you need it now, you may have to discard it all if you decide to change the transformation method.
Like all architectural design decisions, it's all about trade-offs. You have to decide what is more important to you and your application.

Addressing CRUD "tables" in event sourcing

I'm starting down an ES journey and want to know if traditional support tables should be stored in the event log or should those be handled differently? These tables would typical have a CRUD page. In other words, would it be common to have 2 approaches in the same application, one for support tables and one for transactional data?
A support table would be like "Account" in an accounting application or "Product Type" or the actual "Product" table in an ERP application (I'm not writing an ERP application - that's an example of the type of table I'm talking about).
If we store CRUD-type data in the event log, then we might have events:
ProductCreated
ProductUpdated
ProductDeleted (which would just mark it as deleted)
Then, do we attempt to find out what changed (in ProductUpdated event) and just store the change and replay to get the latest image of the Product?
Mostly, I'm after what approach to use for CRUD tables - traditional or store in the event log? Additional information would be great!
Suppose you start purely with an event log, including for events like ProductCreated, etc., and no other data store. What happens then is that every time your application starts up, it has to replay all the events in the log to build its current state.
Now, suppose you create a traditional SQL table to store the current state of your app (say a products table) and the ID of the last event that was processed to get to that state (say a last_event table). What happens then is every time your app starts up, it has to replay only the events with higher IDs than the stored ID and process those to build its new state.
On the flip side, your app now has to be careful to keep these two states synchronised. If you need to have concurrency, you'll need to be careful to do atomic operations only on your SQL tables--but that should be reasonably easy with transacctions.
Your support tables are just a read-model/projection of the event stream. In general you don't create those support models in case you need them. You create a read-model only if you use it somewhere in the UI.
Anyway, one important benefit behind Event sourcing is that you won't need to use join in your queries. That is, you create a table for each read-model that contains all the data it needs - full denormalisation. You keep that table super-optimised for the query.

Text search for microservice architectures

I am investigating into implementing text search on a microservice based system. We will have to search for data that span across more than one microservice.
E.g. say we have two services for managing Organisations and managing Contacts. We should be able to search for organisations by contact details in one search operation.
Our preferred search solution is Elasticsearch. We already have a working solution based on embedded objects (and/or parent-child) where when a parent domain is updated the indexing payload is enriched with the dependent object data, which is held in a cache (we avoid making calls to the service managing child directly for this purpose).
I am wondering if there is a better solution. Is there a microservice pattern applicable to such scenarios?
It's not particularly a microservice pattern I would suggest you, but it fits perfectly into microservices and it's called Event sourcing
Event sourcing describes an architectural pattern in which events are generated by different sources. An event will now trigger 0 or more so called Projections which then use the data contained in the event to aggregate information in the form it is needed.
This is directly applicable to your problem: Whenever the organisation service changes it's internal state (Added / removed / updated an organization) it can fire an event. If an organization is added, it will for example aggregate the contacts to this organization and store this aggregate. The search for it is now trivial: Lookup the organizations id in the aggregated information (this can be indexed) and get back the contacts associated with this organization. Of course the same works if contracts are added to the contract service: It just fires a message with the contract creation information and the corresponding projections now alter different aggregates that can again be indexed and searched quickly.
You can have multiple projections responding to a single event - which enables you to aggregate information in many different forms - exactly the way you'd like to query it later. Don't be afraid of duplicated data: event sourcing takes this trade-off intentionally and since this is not the data your business-services rely on and you do not need to alter it manually - this duplication will not hurt you.
If you store the events in the chronological order they happened (which I seriously advise you to do!) You can 'replay' these events over and over again. This helps for example if a projection was buggy and has to be fixed!
If your're interested I suggest you read up on event sourcing and look for some kind of event store:
event sourcing
event store
We use event sourcing to aggregate an array of different searches in our system and we aggregate millons of records every day into mongodb. All projections have their own collection create their own indexes and until now we never had to resort to different systems / patterns like elastic search or the likes!
Let me know if this helped!
Amendment
use the data contained in the event to aggregate information in the form it is needed
An event should contain all the information necessary to aggregate more information. For example if you have an organization creation event, you need to at least provide some information on what the organizations name is, an ID of some kind, creation date, parent organizations ID etc. As a rule of thumb, we send all the information we gather in the service that gets the request (don't take it directly form the request ;-) check it first, then write it to the event and send it off) because we do not know what we're gonna need in the future. Just stay cautious - payloads should not get too large!
We can now have multiple projections responding to this event: One that adds the organizations to it's parents aggregate (to get an easy lookup for all children of a given organization), one that just adds it to the search set of all organizations and maybe a third that aggregates all the parents of a given child organization so the lookup for the parent organizations is easy and fast.
We have the same service process these events that also process client requests. The motivation behind it is, that the schema of the data that your projections create is tightly coupled to the way it is read by the service that the client interacts with. This does not have to be that way and it could be separated into two services - but you create an almost invisible dependency there and releasing these two services independently becomes even more challenging. But if you do not mind that additional level of complexity - you can separate the two.
We're currently also considering writing a generic service for aggregating information from events for things like searches, where projections could be scripted. That only makes the invisible dependencies problem less conspicuous, it does not solve it.

How to design an event receiver which is capable of showing the recent events

The home page of meetup shows information on the recent meetups on the right hand side of the page. What kind of design patterns / tools (pref java based) would you use to implement such an output.
There's a couple of different approaches, which one you use would depend on several factors including the complexity of the business processes, the degree of flexibility desired and load.
Simple Solution
"RSVP Updates" are written directly to some data source during the "RSVP" process; this process is essentially hard-coded.
Have something that reads out the RSVP directly from the whatever data source / table they live in.
This solution will be fine if the load and data volumes are excessive. The key point is that the RSVP UI widget ends up pulling the data out of the same data source as where the updates are written to.
Performance
A few different options, based on the above as a starting point:
Hold the data twice: once in the "master" (Transactional) table of RSVP data, and once in a table built for servicing the UI (Basically OLTP vs OLAP). The second table would include all the relevant data so that there were no look-ups to other tables, and as it's an independent copy of the data you can manage it differently if you want to (for example: purge out old records so the table size is kept small).
Or, instead of a second table just keep all the data in memory. this would require you to pull the data out of the main transactional table whenever the in memory copy is lost.
Flexibility
Same as the original approach but instead of hard-coding in the step that records the RSVP (into a single data source) use a more loosely-coupled approach so that you can add / change / remove as many event processors as you wish. One will write the RSVP data to the main RSVP data source, while a second will do the same/similar but aggregated ready for the
"Recent RSVPs" UI Widget.
Dependency Injection will give you the flexibility - certainly if you deal with a single implementation of the event handler.
The Publish / Subscribe or Chain of Responsibility patterns might give you the basis of an approach.
Is that the kind of info you were after?

Thread-safe unique entity instance in Core Data

I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.

Resources