How to approach event sourcing with millions of records - event-sourcing

We're looking at implementing event sourcing / CQRS and for 95% of our system I can reason about the events and it doesn't scare me.
On the other hand, we also have a requirement where customers can insert data for millions of records in one go. A large portion of them can be updated in one go as they move location etc or have batch level details updated. It also needs to be reversed if they change their mind moments after.
Each record relates to a physical entity in the real world and it's important that the read model is updated quickly and the audit trail preserved at all costs for each record.
I can't seem to find any advice on how to handle these volumes. Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive? Are there any case studies that have similar requirements?
Any guidance is appreciated.

Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive?
A potentially helpful heuristic -- how would you do it with a version control system? Would you create an empty document, and then introduce a million commits, or would you have a single Data Imported commit, and go from there?
An important consideration to notice is that the authority for the data is somewhere else. "Physical entities in the real world" are not subject to the rules of your domain model; you you have there is a big pile of reference data.
It can help to think in processes -- what you have is an import reference data process, that has a relatively small number of immediate steps, and independently some "do interesting things with each record" which may turn out to be millions of little processes with some small number of events.

Related

what is more important speed(time) or space

I had this question when I was given a task which made me think for a while. But not able to come up with more accurate or satisfying answer.
The task was something like this,
Things I already have
a table "User" which contain details about the user. Like createdTime, type(agent, admin, other), id, etc.,(This table contains too many rows(entries) in it)
The task given to me is
create a new table which will keep track of user which are deleted.
Then join the "User" table with this newly created table and show the user which are not already deleted and are of type='agent'.
Now my question is
"why are they asking me to create a new table instead of creating a new column(in User table) which will store a flag(true If user is deleted, else not deleted)
"Is join not time consuming?"
(creating a new column in the current "User" table will help in keeping the detail intact with the User.But creating a new table - where this will help?)
When I asked this question to my team member, he replied "If you create a new column, the value can be empty for a lot of rows, and space wastage" (what he said is right too.)
This states that he is caring about the space more and less on query time.
Nowadays we can get any amount of space with money, But speed is more important right. If the speed is less, then what is the use of anything which saves our disk space.
Shouldn't we care about time more and space less?.
I would like to know what you think about this. What is more important to you and why?.
I know this question can be down-voted But I wanted to know what most developers think in such case. What they care about more is it space or time?
Thanks for your time.
This will be an answer even though it's personally speaking because it's too big for a comment:
The two are rarely separate from each other but a function of each other.
Speed is also not just 'query time' - but basically execution time including data handling. From request to reply.
So if you have something that hogs space, you have more I/O and more memory usage and need to spend more time accessing data and more time transmitting data.
So my "answer" would be:
Treat both as equal and minimize both.
This states that he is caring about the space more and less on query time. :
No. That's a wrong assumption.
If you start thinking normalization and de-normalization in the database when thinking performance, you are most always better off normalizing data.
This is not only to save space, but also to save maintenance of data/data integrity (faster update, less locking); Indexing (space, yes - but also speed) and then when using the data I/O transfer from disk to server, to memory usage, across networks. Also - space usage means memory usage and the more things take up space, the more you want to also put into memory.
All these things leads to performance, aka speed.
The times you start thinking of demoralization in the database, you usually do it in connection with pre-calculating results and queries and utilize caching so you don't have to do the joins on demand. So while there are some situations where de-normalization is a plausible solution - more often than not, you're still better off normalizing the data.

Two processes saving a record in database at the same time

What happens when more than one user inserts data in Database (MySQL, Postgres) at exactly same time? How does it prioritize which record to be inserted first and which one later. If the answer is specific to application of program, I am asking in reference to web-applications.
In general, two things never happen at exactly the same time. There's a queue of work and at some level one thing always happens before the other.
However, there are cases where an overall transaction may take multiple steps -- and if two of these kinds of transactions begin at nearly the same time, they may overlap in time. This can cause problems.
For example, imagine a person buys something in a shopping cart and the steps include both creating an order record for them and decrementing and inventory count. If two people begin this process at nearly the same time, they could both potentially buy the item before the inventory is decremented to show the item out of stock.
In cases where things like this can occur, postgres (and other modern databases) provide ways to restrict for programs to protect themselves. These include both transactions and locking.
With transactions (see postgres docs here), groups of statements are run as a single unit -- and if one of the later steps fails, all steps are 'rolled back'. (For example, if decrementing inventory isn't possible because the item is now out of stock, the order creation can be rolled back.)
With locking (see postgres docs here), tables (or even individual rows in a table) are locked so that any other process wanting to access them either waits or is timed out. This would prevent two processes from updating the same data at nearly the same time.
In general, the vast majority of applications don't require either of these approaches. Unless you're working in an environment such as at a bank where the tables involved contain financial transactions, you probably won't have to worry about it.
It's never exactly the same time. One will happen before the other.
Which one will, unless you implement your own prioritisation mechanism, is indeterminate, and you should never rely on it.
As to what will happen, well that depends.
For two inserts to the same table, if data integrity is dependant on what order they are executed in your database design has a horrendous flaw.
For collisions (two updates to the same record for instance). There are two implementations.
Pessimistic locking. Assume there will be a significant number of updates to teh same data, so issue a lock around it. If The lock exists fail the update (e.g. second one if first hasn't finished) with some suitable message.
Optimistic locking. Assume collisions will rarely happen. Usual way of doing this is to add a timestamp field to the record which changes every update. So when you read the data you get the timestamp, and when you write the data you only do it, if the timestamp you have matches the one that's there now, and update said timestamp as part of it. If it does not match you do the "Someone else has changed this data message".
There is a compromise position, where you try and merge two updates. (for instance you change name and I change address). You need to really think about that though, it's messy, and get very complicated very quickly, and getting it wrong run's a real risk of messing up the data.
People with far larger IQs than mine spend a lot of time on this stuff, personally I like to keep it like me, simple...

Calculating results in a scalable way based on transaction data in web app?

Possible duplicate:
Database design: Calculating the Account Balance
I work with a web app which stores transaction data (e.g. like "amount x on date y", but more complicated) and provides calculation results based on details of all relevant transactions[1]. We are investing a lot of time into ensuring that these calculations perform efficiently, as they are an interactive part of the application: i.e. a user clicks a button and waits to see the result. We are confident, that for the current levels of data, we can optimise the database fetching and calculation to complete in an acceptable amount of time. However, I am concerned that the time taken will still grow linearly as the number of transactions grow[2]. I'd like to be able to say that we could handle an order of magnitude more transactions without excessive performance degradation.
I am looking for effective techniques, technologies, patterns or algorithms which can improve the scalability of calculations based on transaction data.
There are however, real and significant constraints for any suggestion:
We currently have to support two highly incompatible database implementations, MySQL and Oracle. Thus, for example, using database specific stored procedures have roughly twice the maintenance cost.
The actual transactions are more complex than the example transaction given, and the business logic involved in the calculation is complicated, and regularly changing. Thus having the calculations stored directly in SQL are not something we can easily maintain.
Any of the transactions previously saved can be modified at any time (e.g. the date of a transaction can be moved a year forward or back) and calculations are expected to be updated instantly. This has a knock-on effect for caching strategies.
Users can query across a large space, in several dimensions. To explain, consider being able to calculate a result as it would stand at any given date, for any particular transaction type, where transactions are filtered by several arbitrary conditions. This makes it difficult to pre-calculate the results a user would want to see.
One instance of our application is hosted on a client's corporate network, on their hardware. Thus we can't easily throw money at the problem in terms of CPUs and memory (even if those are actually the bottleneck).
I realise this is very open ended and general, however...
Are there any suggestions for achieving a scalable solution?
[1] Where 'relevant' can be: the date queried for; the type of transaction; the type of user; formula selection; etc.
[2] Admittedly, this is an improvement over the previous performance, where an ORM's n+1 problems saw time taken grow either exponentially, or at least a much steeper gradient.
I have worked against similar requirements, and have some suggestions. Alot of this depends on what is possible with your data. It is difficult to make every case imaginable quick, but you can optimize for the common cases and have enough hardware grunt available for the others.
Summarise
We create summaries on a daily, weekly and monthly basis. For us, most of the transactions happen in the current day. Old transactions can also change. We keep a batch and under this the individual transaction records. Each batch has a status to indicate if the transaction summary (in table batch_summary) can be used. If an old transaction in a summarised batch changes, as part of this transaction the batch is flagged to indicate that the summary is not to be trusted. A background job will re-calculate the summary later.
Our software then uses the summary when possible and falls back to the individual transactions where there is no summary.
We played around with Oracle's materialized views, but ended up rolling our own summary process.
Limit the Requirements
Your requirements sound very wide. There can be a temptation to put all the query fields on a web page and let the users choose any combination of fields and output results. This makes it very difficult to optimize. I would suggest digging deeper into what they actually need to do, or have done in the past. It may not make sense to query on very unselective dimensions.
In our application for certain queries it is to limit the date range to not more than 1 month. We have in aligned some features to the date-based summaries. e.g. you can get results for the whole of Jan 2011, but not 5-20 Jan 2011.
Provide User Interface Feedback for Slow Operations
On occasions we have found it difficult to optimize some things to be shorter than a few minutes. We shirt a job off to a background server rather than have a very slow loading web page. The user can fire off a request and go about their business while we get the answer.
I would suggest using Materialized Views. Materialized Views allow you to store a View as you would a table. Thus all of the complex queries you need to have done are pre-calculated before the user queries them.
The tricky part is of course updating the Materialized View when the tables it is based on change. There's a nice article about it here: Update materialized view when urderlying tables change.
Materialized Views are not (yet) available without plugins in MySQL and are horribly complicated to implement otherwise. However, since you have Oracle I would suggest checking out the link above for how to add a Materialized View in Oracle.

what is a good schema design for a loan origination system?

I am designing a loan origination system which would allow it's users to create loans, draw repayment schedule of the loan depending on the loan product parameters. I should also be able to add penalty, fees etc. Rescheduling loan should be possibility. I also need a loan schedule to do fast reporting.
I have a loans table, loan product table, payment schedule table and loan history table etc. I am not able to understand how I can design ahead this schema to keep it from changing too much.
I am doing this in ruby, rails3 and datamapper.
Except in the most tightly specified applications, I'm not sure you can design a schema that won't change much. What you can do is to make schemas that are not brittle, schemas that allow for change to happen. For the most part, that means.
Include only the data you know you need to meet today's requirements
Normalize.
Write automated tests.
The first rule is akin to "do the simplest thing possible," or "you ain't gonna need it," the rule that programmers use to avoid code bloat. Smaller schemas, like smaller code bases, need less effort to change. The second (normalize) is analagous to the Don't Repeat Yourself (DRY) principle, also known as "once and only once," another rule used to make code cheaper to change. The third rule (tests) is how programmers make refactoring possible without worrying about breaking everything. By tests, I mean testing the code that uses the schema, but also testing the schema itself: triggers, rules, cascading deletes, &c. can be tested, and when tested, it is easier to change them in response to changing requirements.
There are excuses, in the database world, for breaking these rules. The reason to break rule 1 (do the simplest thing/YAGNI) is that some data will be easier to collect from the beginning, and difficult or perhaps even impossible to collect if you decide you do need it later. Still, think twice before giving in to this excuse. You can almost always deal without too much fuss with gaps in the data caused by adding columns or tables later, but if you include today data you might only need tomorrow, you will be paying for it every time you change the schema. Each bit of data you include that you end up not needing was nothing but cost with no benefit. Perhaps more significantly, extra data can have a terrible effect upon performance, since it reduces the number of records that can fit in memory. Even though databases go through great pains to give good performance when reading from disk, their best performance comes from having enough memory (or little enough data) so that all or most of the working set will fit in RAM.
The excuse for breaking rule 2 (normalization) is performance: "Data warehouse" applications sometimes require denormalization in cases where many-table joins make a database slow and cranky. I'd want to be certain it was needed before denormalizing, since it's not free: data that exists in more than once place makes the schema more difficult to change, and trades off speed of queries for more work when inserting & updating.
I don't know of an excuse for breaking rule 3 (testing), or at least not a good one, but that doesn't mean there isn't one.
Martin Fowler writes "Evolutionary Database Design". Scott Amber and Pramod Sadalage have a book on Refactoring Databases. See also a summary/cheat sheet of the book's refactorings.

schema-less data warehouse and reporting

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.

Resources