Cost of time-stamping as a method of concurrency control with Entity Framework - performance

In concurrency, in optimistic concurrency the way to control the concurrency is using a timestamp field. However, in my particular case, not all the fields need to be controlled in respect to concurrency.
For example, I have a products table, holding the amount of stock. This table has fields like description, code... etc. For me, it is not a problem that one user modifies these fields, but I have to control if some other user changes the stock.
So if I use a timestamp and one user changes the description and another changes the amount of stock, the second user will get an exception.
However, if I use the field stock instead of concurrency exception, then the first user can update the information and the second can update the stock without problems.
Is it a good solution to use the stock field to control concucrrency or is it better to always use a timestamp field?
And if in the future I need to add a new important field, then I need to use two fields to control concurrency for stock and the new one? Does it have a high cost in terms of performance?

Consider the definition of optimistic concurrency:
In the field of relational database management systems, optimistic concurrency control (OCC) is a concurrency control method that assumes that multiple transactions can complete without affecting each other, and that therefore transactions can proceed without locking the data resources that they affect. (Wikipedia)
Clearly this definition is abstract and leaves a lot of room for your specific implementation.
Let me give you an example. A few years back I evaluated the same thing with a bunch of colleagues and we realized that in our application, on some of the tables, it was okay for the concurrency to simply be based on the fields the user was updating.
So, in other words, as long as the fields they were updating hadn't changed since they gathered the row, we'd let them update the row because the rest of the fields really didn't matter and and row was going to get refreshed on udpate anyway so they would get the most recent changes by other users.
So, in short, I would say what you're doing is just fine and there aren't really any hard and fast rules. It really depends on what you need. If you need it to be more flexible, like what you're talking about, then make it more flexible -- simple.

Related

How to update ReadModel of an Aggregate that has an association with another Aggregate

I'm trying to separate read and write models. In summary I have this 2 entities with an association between them:
//AgregateRoot
class ProfessionalFamily {
private ProfessionalFamilyId id;
private String name;
}
//AgregateRoot
class Group {
private GroupId id;
private String literal;
private ProfessionalFamilyId professionalFamilyId; //ManyToOne association referenced by the ID of "professional-family"
}
The read model I'm using for return data in a Grid is the next one.
class GroupReadModel {
private String id;
private String groupLiteral;
private String professionalFamilyName;
}
I want to use NoSql for ReadModel queries and separate them for the write models. But my headache is: with that approach, when a Group is created I fire an Event (GroupCreated) and an Event handler listen the Event and store de Read/View/Projection Model in the NoSql database. So my question is: If I need to update the ProfessionalFamilyName and this is related with more than, for example 1000 groups (there are many more groups), how can I update all the Groups in ReadModel who is related with the professionalFamily I've been updated? Most probably I'm not doing anything well.
Thanks a lot.
NoSql databases are usually not designed to support data normalization and even intentionally break with this concept. If you would use a relational database system you would usually normalize your data and to each group you would only store the id of the ProfessionalFamily rather than duplicating the name of the ProfessionalFamily in each group document. So in general, for NoSql database duplication is accepted.
But I think before deciding to go with NoSql or a relational database you should consider (at least) the following:
Priority for speed of reads vs. writes:
If you need your writes (in your case changes of the name) to be very fast as they happen very often and the read speed is of lower priority maybe NoSql is not the best choice. You could still look into technology such as MongoDB which provides some kind of hybrid approach and allows to normalize and index data to a certain extent.
Writes will usally be faster when having a normalized structure in a relational database whereas reads will normally be faster without normalization and duplication in a NoSql database. But this is of course dependent on the technologies at hand which you are comparing as well as the amount of entities (in your case Groups) we are talking about as well as the amount of cross-referenced data. If you need to do lots of joins during the reads due to normalization you read performance will usually be worse compared to Group documents where all required data is already there due to duplication.
Control over the data structure/schema
If you are the one who knows how the data will look like you might not need the advantage of a NoSql database which is very well suited for data structures that change frequently or you are not in control of. If this is not really the case you might not benefit enough from NoSql technology.
And in addition there is another thing to consider: how consistent does your read model data have to be? As you are having some kind of event-sourcing approach I guess you are already embracing eventual consistency. That means not only the event processing is performed asynchronously but you could also accept that - getting back to your example - not all groups are being updated with the new family name at the same time but as well asynchronously or via some background jobs if it is not a problem that one Group still shows the old name while some other group already shows the new name for some time.
Most probably I'm not doing anything well.
You are not doing anything wrong or right per-se choosing this approach as long as you decide for NoSql (or against) for the right reasons which include these considerations.
My team and I discussed a similar scenario recently and we solved it by changing our CRUD approach to a DDD approach. Here is an example:
Given a traveler with a list of visited destinations.
If I have an event such a destinationUpdated then I should loop accross every travelers like you said, but does it make sens? What destinationUpdated means from a user point of view? Nothing! You should find the real user intent.
If the traveler made a mistake entering is visited destination then your event should be travelerCorrectedDestination which solve the problem because travelerCorrectedDestination now contains the travaler ID so you don't have to loop through all travelers anymore.
By applying a DDD approach problems usually solve by themselfs.

Which spring transaction isolation level to use to maintain a counter for product sold?

I have an e-commerce site written with Spring Boot + Angular. I need to maintain a counter in my product table for tracking how many has been sold. But the counter sometime becomes inaccurate when many users are ordering the same item concurrently.
In my service code, I have the following transactional declaration:
#Transactional(propagation = Propagation.REQUIRES_NEW, isolation = Isolation.READ_COMMITTED)
in which, after persisting the order (using CrudRepository.save()), I do a select query to sum the quantities being ordered so far, hoping the select query will count all orders have been committed. But that doesn't seem to be the case, from time to time, the counter is less than the actual number.
Same issue happens for my other use case: quantity limit a product. I use the same transaction isolation setting. In the code, I'll do a select query to see how many has been sold and throw out of stock error if we can't fulfill the order. But for hot items, we some times oversold the item because each thread doesn't see the orders just committed in other threads.
So is READ_COMMITTED the right isolation level for my use case? Or I should do pessimistic locking for this use case?
UPDATE 05/13/17
I chose Ruben's approach as I know more about java than database so I took the easier road for me. Here's what I did.
#Transactional(propagation = Propagation.REQUIRES_NEW, isolation = Isolation.SERIALIZABLE)
public void updateOrderCounters(Purchase purchase, ACTION action)
I'm use JpaRepository so I don't play entityManager directly. Instead, I just put the code to update counters in a separate method and annotated as above. It seems to work well so far. I have seen >60 concurrent connections making orders and no oversold and the response time seems ok as well.
Depending on how you retrieve the total sold items count the available options might differ :
1. If you calculate the sold items count dynamically via a sum query on orders
I believe in this case the option you have is using SERIALIZABLE isolation level for the transaction, since this is the only one which supports range locks and prevents phantom reads.
However, I would not really recommend going with this isolation level since it has a major performance impact on your system (or used really carefully on a well designed spots only).
Links : https://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html#isolevel_serializable
2. If you maintain a counter on product or some other row associated with the product
In this case I would probably recommend using row level locking eg select for update in a service method which checks the availability of the product and increments the sold items count. The high level algorithm of the product placement could be similar to the steps below :
Retrieve the row storing the number of remaining/sold items count using the select for update query (#Lock(LockModeType.PESSIMISTIC_WRITE) on a repository method).
Make sure that the retrieved row has up to date field values since it could be retrieved from the Hibernate session level cache (hibernate would just execute select for update query on the id just to acquire the lock). You can achieve this by calling 'entityManager.refresh(entity)'.
Check the count field of the row and if the value is fine with your business rules then increment/decrement it.
Save the entity, flush the hibernate session, and commit the transaction (explicitly or implicitly).
A meta code is below :
#Transactional
public Product performPlacement(#Nonnull final Long id) {
Assert.notNull(id, "Product id should not be null");
entityManager.flush();
final Product product = entityManager.find(Product.class, id, LockModeType.PESSIMISTIC_WRITE);
// Make sure to get latest version from database after acquiring lock,
// since if a load was performed in the same hibernate session then hibernate will only acquire the lock but use fields from the cache
entityManager.refresh(product);
// Execute check and booking operations
// This method call could just check if availableCount > 0
if(product.isAvailableForPurchase()) {
// This methods could potentially just decrement the available count, eg, --availableCount
product.registerPurchase();
}
// Persist the updated product
entityManager.persist(product);
entityManager.flush();
return product;
}
This approach will make sure that no any two threads/transactions will be ever performing a check and update on the same row storing the count of a product concurrently.
However, because of that it will also have some performance degradation effect on your system hence it is essential to make sure that atomic increment/decrement is being used as far in the purchase flow as possible and as rare as possible (eg, right in the checkout handling routine when customer hits pay). Another useful trick for minimizing the effect of a lock would be adding that 'count' column not on a product itself but on a different table which is associated with the product. This will prevent you from locking the products rows, since the locks will be acquired on a different row/table combination which are used purely during the checkout stage.
Links: https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html
Summary
Please note that both of the techniques introduce extra synchronization points in your system hence reducing throughput. So please make sure to carefully measure the impact it has on your system via performance test or any other technique which is being used in your project for measuring the throughput.
Quite often online shops choose going towards overselling/booking some items rather then affecting the performance.
Hope this helps.
With these transaction settings, you should see the stuff that is committed. But still, your transaction handling isn't water tight. The following might happen:
Let's say you have one item in stock left.
Now two transactions start, each ordering one item.
Both check the inventory and see: "Fine enough stock for me."
Both commit.
Now you oversold.
Isolation level serializable should fix that. BUT
the isolation levels available in different databases vary widely, so I don't think it is actually guaranteed to give you the requested isolation level
this limits seriously limits scalability. The transactions doing this should be as short and as rare as possible.
Depending on the database you are using it might be a better idea to implement this with a database constraint. In oracle, for example, you could create a materialized view calculating the complete stock and put a constraint on the result to be non-negative.
Update
For the materialized view approach you do the following.
create materialized view, that calculates the value that you want to constraint, e.g. the sum of orders. Make sure the materialized view gets updated in the transaction that change the content of the underlyingt tables.
For oracle this is achieved by the ON COMMIT clause.
ON COMMIT Clause
Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process.
See https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm for more details.
Put a check constraint on that materialized view to encode the constraint that you want, e.g. that the value is never negative. Note, that a materialized view is just another table, so you can create constraints just as you would normaly do.
See fore example https://www.techonthenet.com/oracle/check.php

Database design: Same table structure but different table

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

Strategy for updating data in databases (Oracle)

We have a product using Oracle, with about 5000 objects in the database (tables and packages). The product was divided into two parts, the first is the hard part: client, packages and database schema, the second is composed basically by soft data representing processes (Workflow) that can be configured to run on our product.
Well, the basic processes (workflow) are delivered as part of the product, our customers can change these processes and adapt them to their needs, the problem arises when trying to upgrade to a newer version of the product, then trying to update the database records data, there are problems for records deleted or modified by our customers.
Is there a strategy to handle this problem?
It is common for a software product to be comprised of not just client and schema objects, but data as well; typically it seems to be called "static data", i.e. it is data that should only be modified by the software developer, and is usually not modifiable by end users.
If the end users bypass your security controls and modify/delete the static data, then you need to either:
write code that detects, and compensates for, any modifications the end user may have done; e.g. wipe the tables and repopulate with "known good" data;
get samples of modifications from your customers so you can hand-code customised update scripts for them, without affecting their customisations; or
don't allow modifications of static data (i.e. if they customise the product by changing data they shouldn't, you say "sorry, you modified the product, we don't support you".
From your description, however, it looks like your product is designed to allow customers to customise it by changing data in these tables; in which case, your code just needs to be able to adapt to whatever changes they may have made. That needs to be a fundamental consideration in the design of the upgrade. The strategy is to enumerate all the types of changes that users may have made (or are likely to have made), and cater for them. The only viable alternative is #1 above, which removes all customisations.

Thread-safe unique entity instance in Core Data

I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.

Resources