Two processes saving a record in database at the same time - ruby

What happens when more than one user inserts data in Database (MySQL, Postgres) at exactly same time? How does it prioritize which record to be inserted first and which one later. If the answer is specific to application of program, I am asking in reference to web-applications.

In general, two things never happen at exactly the same time. There's a queue of work and at some level one thing always happens before the other.
However, there are cases where an overall transaction may take multiple steps -- and if two of these kinds of transactions begin at nearly the same time, they may overlap in time. This can cause problems.
For example, imagine a person buys something in a shopping cart and the steps include both creating an order record for them and decrementing and inventory count. If two people begin this process at nearly the same time, they could both potentially buy the item before the inventory is decremented to show the item out of stock.
In cases where things like this can occur, postgres (and other modern databases) provide ways to restrict for programs to protect themselves. These include both transactions and locking.
With transactions (see postgres docs here), groups of statements are run as a single unit -- and if one of the later steps fails, all steps are 'rolled back'. (For example, if decrementing inventory isn't possible because the item is now out of stock, the order creation can be rolled back.)
With locking (see postgres docs here), tables (or even individual rows in a table) are locked so that any other process wanting to access them either waits or is timed out. This would prevent two processes from updating the same data at nearly the same time.
In general, the vast majority of applications don't require either of these approaches. Unless you're working in an environment such as at a bank where the tables involved contain financial transactions, you probably won't have to worry about it.

It's never exactly the same time. One will happen before the other.
Which one will, unless you implement your own prioritisation mechanism, is indeterminate, and you should never rely on it.
As to what will happen, well that depends.
For two inserts to the same table, if data integrity is dependant on what order they are executed in your database design has a horrendous flaw.
For collisions (two updates to the same record for instance). There are two implementations.
Pessimistic locking. Assume there will be a significant number of updates to teh same data, so issue a lock around it. If The lock exists fail the update (e.g. second one if first hasn't finished) with some suitable message.
Optimistic locking. Assume collisions will rarely happen. Usual way of doing this is to add a timestamp field to the record which changes every update. So when you read the data you get the timestamp, and when you write the data you only do it, if the timestamp you have matches the one that's there now, and update said timestamp as part of it. If it does not match you do the "Someone else has changed this data message".
There is a compromise position, where you try and merge two updates. (for instance you change name and I change address). You need to really think about that though, it's messy, and get very complicated very quickly, and getting it wrong run's a real risk of messing up the data.
People with far larger IQs than mine spend a lot of time on this stuff, personally I like to keep it like me, simple...

Related

How to approach event sourcing with millions of records

We're looking at implementing event sourcing / CQRS and for 95% of our system I can reason about the events and it doesn't scare me.
On the other hand, we also have a requirement where customers can insert data for millions of records in one go. A large portion of them can be updated in one go as they move location etc or have batch level details updated. It also needs to be reversed if they change their mind moments after.
Each record relates to a physical entity in the real world and it's important that the read model is updated quickly and the audit trail preserved at all costs for each record.
I can't seem to find any advice on how to handle these volumes. Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive? Are there any case studies that have similar requirements?
Any guidance is appreciated.
Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive?
A potentially helpful heuristic -- how would you do it with a version control system? Would you create an empty document, and then introduce a million commits, or would you have a single Data Imported commit, and go from there?
An important consideration to notice is that the authority for the data is somewhere else. "Physical entities in the real world" are not subject to the rules of your domain model; you you have there is a big pile of reference data.
It can help to think in processes -- what you have is an import reference data process, that has a relatively small number of immediate steps, and independently some "do interesting things with each record" which may turn out to be millions of little processes with some small number of events.

what is more important speed(time) or space

I had this question when I was given a task which made me think for a while. But not able to come up with more accurate or satisfying answer.
The task was something like this,
Things I already have
a table "User" which contain details about the user. Like createdTime, type(agent, admin, other), id, etc.,(This table contains too many rows(entries) in it)
The task given to me is
create a new table which will keep track of user which are deleted.
Then join the "User" table with this newly created table and show the user which are not already deleted and are of type='agent'.
Now my question is
"why are they asking me to create a new table instead of creating a new column(in User table) which will store a flag(true If user is deleted, else not deleted)
"Is join not time consuming?"
(creating a new column in the current "User" table will help in keeping the detail intact with the User.But creating a new table - where this will help?)
When I asked this question to my team member, he replied "If you create a new column, the value can be empty for a lot of rows, and space wastage" (what he said is right too.)
This states that he is caring about the space more and less on query time.
Nowadays we can get any amount of space with money, But speed is more important right. If the speed is less, then what is the use of anything which saves our disk space.
Shouldn't we care about time more and space less?.
I would like to know what you think about this. What is more important to you and why?.
I know this question can be down-voted But I wanted to know what most developers think in such case. What they care about more is it space or time?
Thanks for your time.
This will be an answer even though it's personally speaking because it's too big for a comment:
The two are rarely separate from each other but a function of each other.
Speed is also not just 'query time' - but basically execution time including data handling. From request to reply.
So if you have something that hogs space, you have more I/O and more memory usage and need to spend more time accessing data and more time transmitting data.
So my "answer" would be:
Treat both as equal and minimize both.
This states that he is caring about the space more and less on query time. :
No. That's a wrong assumption.
If you start thinking normalization and de-normalization in the database when thinking performance, you are most always better off normalizing data.
This is not only to save space, but also to save maintenance of data/data integrity (faster update, less locking); Indexing (space, yes - but also speed) and then when using the data I/O transfer from disk to server, to memory usage, across networks. Also - space usage means memory usage and the more things take up space, the more you want to also put into memory.
All these things leads to performance, aka speed.
The times you start thinking of demoralization in the database, you usually do it in connection with pre-calculating results and queries and utilize caching so you don't have to do the joins on demand. So while there are some situations where de-normalization is a plausible solution - more often than not, you're still better off normalizing the data.

Calculating results in a scalable way based on transaction data in web app?

Possible duplicate:
Database design: Calculating the Account Balance
I work with a web app which stores transaction data (e.g. like "amount x on date y", but more complicated) and provides calculation results based on details of all relevant transactions[1]. We are investing a lot of time into ensuring that these calculations perform efficiently, as they are an interactive part of the application: i.e. a user clicks a button and waits to see the result. We are confident, that for the current levels of data, we can optimise the database fetching and calculation to complete in an acceptable amount of time. However, I am concerned that the time taken will still grow linearly as the number of transactions grow[2]. I'd like to be able to say that we could handle an order of magnitude more transactions without excessive performance degradation.
I am looking for effective techniques, technologies, patterns or algorithms which can improve the scalability of calculations based on transaction data.
There are however, real and significant constraints for any suggestion:
We currently have to support two highly incompatible database implementations, MySQL and Oracle. Thus, for example, using database specific stored procedures have roughly twice the maintenance cost.
The actual transactions are more complex than the example transaction given, and the business logic involved in the calculation is complicated, and regularly changing. Thus having the calculations stored directly in SQL are not something we can easily maintain.
Any of the transactions previously saved can be modified at any time (e.g. the date of a transaction can be moved a year forward or back) and calculations are expected to be updated instantly. This has a knock-on effect for caching strategies.
Users can query across a large space, in several dimensions. To explain, consider being able to calculate a result as it would stand at any given date, for any particular transaction type, where transactions are filtered by several arbitrary conditions. This makes it difficult to pre-calculate the results a user would want to see.
One instance of our application is hosted on a client's corporate network, on their hardware. Thus we can't easily throw money at the problem in terms of CPUs and memory (even if those are actually the bottleneck).
I realise this is very open ended and general, however...
Are there any suggestions for achieving a scalable solution?
[1] Where 'relevant' can be: the date queried for; the type of transaction; the type of user; formula selection; etc.
[2] Admittedly, this is an improvement over the previous performance, where an ORM's n+1 problems saw time taken grow either exponentially, or at least a much steeper gradient.
I have worked against similar requirements, and have some suggestions. Alot of this depends on what is possible with your data. It is difficult to make every case imaginable quick, but you can optimize for the common cases and have enough hardware grunt available for the others.
Summarise
We create summaries on a daily, weekly and monthly basis. For us, most of the transactions happen in the current day. Old transactions can also change. We keep a batch and under this the individual transaction records. Each batch has a status to indicate if the transaction summary (in table batch_summary) can be used. If an old transaction in a summarised batch changes, as part of this transaction the batch is flagged to indicate that the summary is not to be trusted. A background job will re-calculate the summary later.
Our software then uses the summary when possible and falls back to the individual transactions where there is no summary.
We played around with Oracle's materialized views, but ended up rolling our own summary process.
Limit the Requirements
Your requirements sound very wide. There can be a temptation to put all the query fields on a web page and let the users choose any combination of fields and output results. This makes it very difficult to optimize. I would suggest digging deeper into what they actually need to do, or have done in the past. It may not make sense to query on very unselective dimensions.
In our application for certain queries it is to limit the date range to not more than 1 month. We have in aligned some features to the date-based summaries. e.g. you can get results for the whole of Jan 2011, but not 5-20 Jan 2011.
Provide User Interface Feedback for Slow Operations
On occasions we have found it difficult to optimize some things to be shorter than a few minutes. We shirt a job off to a background server rather than have a very slow loading web page. The user can fire off a request and go about their business while we get the answer.
I would suggest using Materialized Views. Materialized Views allow you to store a View as you would a table. Thus all of the complex queries you need to have done are pre-calculated before the user queries them.
The tricky part is of course updating the Materialized View when the tables it is based on change. There's a nice article about it here: Update materialized view when urderlying tables change.
Materialized Views are not (yet) available without plugins in MySQL and are horribly complicated to implement otherwise. However, since you have Oracle I would suggest checking out the link above for how to add a Materialized View in Oracle.

what is a good schema design for a loan origination system?

I am designing a loan origination system which would allow it's users to create loans, draw repayment schedule of the loan depending on the loan product parameters. I should also be able to add penalty, fees etc. Rescheduling loan should be possibility. I also need a loan schedule to do fast reporting.
I have a loans table, loan product table, payment schedule table and loan history table etc. I am not able to understand how I can design ahead this schema to keep it from changing too much.
I am doing this in ruby, rails3 and datamapper.
Except in the most tightly specified applications, I'm not sure you can design a schema that won't change much. What you can do is to make schemas that are not brittle, schemas that allow for change to happen. For the most part, that means.
Include only the data you know you need to meet today's requirements
Normalize.
Write automated tests.
The first rule is akin to "do the simplest thing possible," or "you ain't gonna need it," the rule that programmers use to avoid code bloat. Smaller schemas, like smaller code bases, need less effort to change. The second (normalize) is analagous to the Don't Repeat Yourself (DRY) principle, also known as "once and only once," another rule used to make code cheaper to change. The third rule (tests) is how programmers make refactoring possible without worrying about breaking everything. By tests, I mean testing the code that uses the schema, but also testing the schema itself: triggers, rules, cascading deletes, &c. can be tested, and when tested, it is easier to change them in response to changing requirements.
There are excuses, in the database world, for breaking these rules. The reason to break rule 1 (do the simplest thing/YAGNI) is that some data will be easier to collect from the beginning, and difficult or perhaps even impossible to collect if you decide you do need it later. Still, think twice before giving in to this excuse. You can almost always deal without too much fuss with gaps in the data caused by adding columns or tables later, but if you include today data you might only need tomorrow, you will be paying for it every time you change the schema. Each bit of data you include that you end up not needing was nothing but cost with no benefit. Perhaps more significantly, extra data can have a terrible effect upon performance, since it reduces the number of records that can fit in memory. Even though databases go through great pains to give good performance when reading from disk, their best performance comes from having enough memory (or little enough data) so that all or most of the working set will fit in RAM.
The excuse for breaking rule 2 (normalization) is performance: "Data warehouse" applications sometimes require denormalization in cases where many-table joins make a database slow and cranky. I'd want to be certain it was needed before denormalizing, since it's not free: data that exists in more than once place makes the schema more difficult to change, and trades off speed of queries for more work when inserting & updating.
I don't know of an excuse for breaking rule 3 (testing), or at least not a good one, but that doesn't mean there isn't one.
Martin Fowler writes "Evolutionary Database Design". Scott Amber and Pramod Sadalage have a book on Refactoring Databases. See also a summary/cheat sheet of the book's refactorings.

Asking for opinions : One sequence for all tables

Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored

Resources