How to remove projections in event sourcing? - event-sourcing

I'm working in an event-sourced system. In that system, data is divided into multiple streams and stored in an event store. Upon processing events, different types of projections are made. At the moment event store does not have any replaying logic implemented. I'm going to implement them. Although, the requirement is only to replay a given stream to a certain time. To perform that, I need to delete all the projections belong that stream. All the projections are stored in a single SQL database. Even though data is separated into individual streams, while designing the table structures for these projections, it does not have a clear interface to recognize that boundary. So to delete a projection, I need to perform a set of complex SQL statements. I feel like it might not be the best solution to the problem. I'm wondering whether there is any alternative to achieve the same goal or whether it's usual in event sourcing. I'm also thinking about redesigning my projections. If so, are there any advice or best practices I should follow? appreciate your help

Related

Documenting the aggregates & relations between microservices

In my organization we're trying to design our microservices based on the Bounded Context (BC) pattern (part of Domain-driven design). While we're doing this we also try to use another DDD pattern called the Context Mapping, to better identify the various contexts in the application, their boundaries and the relations between them.
All of this can be done on a whiteboard or in some online drawing tool. However, I'm looking for a way to generate a complete picture of the various services, what aggregates they contain and potentially the relations between such aggregates (as the same User in one BC might be a Customer in another). A good example is figure 4-10 in here. The generation should ideally be based on some DSL or script which we would maintain, as this kind of work is fairly high-level and context boundaries don't change very often. For example, a team adds a new aggregate or starts keeping a copy of an aggregate from another service, they update the script/DSL and regenerate the diagram.
Solutions I've looked at so far:
Context mapper - it doesn't visualise the aggregates in each BC/service, nor does it show relations
C4 model, Level 2 - we already use it, so it could be fairly easy to add a textual list of aggregates per container, but it's not what it's intended for (and the visualisation is not optimal)
ddd bounded context/microservice canvas - it's too detailed and can't really be used to look at the big picture
I'm wondering how and if this is done in other organization, and looking for suggestions for some tooling that would be of help.
I think the format used for event storming sessions might be worth to have look at in your case. Once done it covers all domain events, commands, actors, read models, policies, external systems. Also it illustrates the bounded contexts in which the aggregates, events, etc. live in. An example can be found here:
https://medium.com/capital-one-tech/event-storming-decomposing-the-monolith-to-kick-start-your-microservice-architecture-acb8695a6e61
I know the format is mainly used for domain exploration but from my experience, if done nicely (e.g. using some tool like Miro, Lucid, or the like) it also provides good documentation and overview of what's going on in your system.

Guidance on Patterns and recommendations on achieving database Atomicity in distributed architecture (microservices)

Folks, I am evaluating options/ pattern and practices around key challenge of maintaining db atomicity (across multiple tables) that we are facing in distributed (microservices) architecture.
Atomicity, reliability and scale all are critical for business(it might have been common across businesses, just putting it out there).
I read few articals about achieving but it all comes at a significant cost and not without certain trade offs, which I am not ready to make.
Read couple of SO questions, and one concept SAGA seems interesting, but I don’t think our legacy database is meant to handle it.
So here I am asking experts of their personal opinion, guidance and past experience so I can save time and effort without try and learn bunch of options.
Appreciate your time and effort.
CAP theorem
CAP theorem is the key when it comes to distributed systems. Start with this to know if you want availability vs consistency.
Distributed transactions
You are right, trade offs involved and there is no right single answer. when it comes to distributed transaction it's no different. In microservices architecture Atomicity is not easy to achieve. Normally we design the microservices by keeping eventual consistency in mind. Strong consistency is very hard and not a simple solution.
SAGA vs 2PC
2PC it's very easy to achieve atomicity using 2 phase commit , but that option is not for microservices. your system can't scale system since if any of the microservice goes down your transaction will hang into abnormal state and locks are very common with this approach.
SAGA is most acceptable and scaleable approach . You commit local transaction (atomically) once done you need to publish the event , and all the interested services will have to consume the event and update their own local database. If there is exception or particular microservices can't accept the event data , it would raise compensation transaction , which mean you have to reverse and undo the actions taken by all microservices against that event. This is widely accepted pattern and is scaleable.
I don't get legacy db part. What makes you think legacy DB will have problem ? SAGA has nothing to do with legacy system . It simply mean if you have to accept the event or not. If yes then save it into database. If not then raise compensated transaction so all other service can undo.
What's the right approach ?
Well it really depends on you eventually. There are many pattern around when it comes to save the transaction . Have a look at CQRS and event sourcing pattern which is used to save all the domain events. Since disturbed transactions can be complex . CQRS solve many problems e.g. eventual consistency etc.
Hope that helps! shoot me questions if you have.
One possible option is Command Query Responsibility Segregation (CQRS) - maintain one or more materialized views that contain data from multiple services. The views are kept by services that subscribe to events that each services publishes when it updates its data. For example, the online store could implement a query that finds customers in a particular region and their recent orders by maintaining a view that joins customers and orders. The view is updated by a service that subscribes to customer and order events.

How to approach event sourcing with millions of records

We're looking at implementing event sourcing / CQRS and for 95% of our system I can reason about the events and it doesn't scare me.
On the other hand, we also have a requirement where customers can insert data for millions of records in one go. A large portion of them can be updated in one go as they move location etc or have batch level details updated. It also needs to be reversed if they change their mind moments after.
Each record relates to a physical entity in the real world and it's important that the read model is updated quickly and the audit trail preserved at all costs for each record.
I can't seem to find any advice on how to handle these volumes. Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive? Are there any case studies that have similar requirements?
Any guidance is appreciated.
Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive?
A potentially helpful heuristic -- how would you do it with a version control system? Would you create an empty document, and then introduce a million commits, or would you have a single Data Imported commit, and go from there?
An important consideration to notice is that the authority for the data is somewhere else. "Physical entities in the real world" are not subject to the rules of your domain model; you you have there is a big pile of reference data.
It can help to think in processes -- what you have is an import reference data process, that has a relatively small number of immediate steps, and independently some "do interesting things with each record" which may turn out to be millions of little processes with some small number of events.

MongoDB / Mongoose Schema for Threaded Messages (Efficiently)

I'm somewhat new to noSQL databases (I'm fairly good with relational databases though), and I'm wondering what the most efficient way to handle an inbox system with threaded messages would be.
Each 'message' will have a single sender and recipient. The number of received / sent messages will vary widely between users. This system should scale well to over 1k+ users.
I've read up on fan out on write / read but I'm not sure how well this would work for threaded messages.
Since I'm new to MongoDB / NoSQL in general, I'm not really used to structuring data efficiently this way.
I'm guessing there's going to be nested objects in any sort of efficient way of handling this...but I can't settle on a design that seems both efficient and convenient for threaded conversations between 2 users.
I thought of storing data with an array of the 2 users, combined with an array of 'message' objects. But then there's the issue of the order of the 2 user's usernames. (ex. [UserA, UserB] and [UserB, UserA] are both possible and would be problematic, so that seemed like a bad idea).
I thought of doing the whole fan out on read / write thing, but that doesn't seem efficient for threaded messages (since if grabbing messages by recipient is convenient, grabbing messages by sender won't be and vice versa).
I'm leaning towards favoring grabbing messages by recipient (since the inbox loads multiple messages, and sending only involves one [albeit with a longer look-up time]). But I'd really like to grab a threaded conversation in one go, as well as the list of users that a user has threaded conversations with (for the list of threads).
If someone could give me an efficient schema for threaded conversations I'd be very grateful. I've been researching this and trying to settle on a design for hours, and I'm exhausted. I keep finding flaws in my designs and scrapping them and I'd really just like some input from someone more experienced with NoSQL databases / MongoDB so I can avoid making a huge design flaw and/or writing logic that could've been handled with a better database design.
Thanks in advance for any and all help.
On this particular topic you are in luck, there is a great post discussing the various approaches to the schema here (it's a slight twist on what you are looking at, but not much different):
http://blog.mongodb.org/post/65612078649/schema-design-for-social-inboxes-in-mongodb
Then, this topic was also covered in detail at MongoDB World 2014 in three parts by Darren Wood and Asya Kamsky:
Part 1 Outline and Video
Part 2 Outline and Video
Part 3 Outline and Video
Also at MongoDB World the guys at Dropbox talked about the lessons they learned when building their Mailbox:
http://www.mongodb.com/presentations/mongodb-mailbox
And then, to round it off, there is a full reference architecture with code called Socialite on Github written by the aforementioned Darren Wood:
https://github.com/10gen-labs/socialite

schema-less data warehouse and reporting

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.

Resources