Handling passive deletion updates (ie. archiving instead of deleting)

Handling passive deletion updates (ie. archiving instead of deleting) - asp.net-mvc-3

We are developing an application based on DDD principles. We have encountered a couple of problems so far that we can't answer nor can we find the answers on the Internet.
Our application is intended to be a cloud application for multiple companies.
One of the demands is that there are no physical deletions from the database. We make only passive deletion by setting Active property of entities to false. That takes care of Select, Insert and Delete operations, but we don't know how to handle update operations.
Update means changing values of properties, but also means that past values are deleted and there are many reasons that we don't want that. One of the primary reason is for Accounting purposes.
If we make all update statements as "Archive old values" and then "Create new values" we would have a great number of duplicate values. For eg., Company has Branches, and Company is the Aggregate Root for Branches. If I change Companies phone number, that would mean I have to archive old company and all of its branches and create completely new company with branches just for one property. This may be a good idea at first, but over time there will be many values which can clog up the database. Phone is maybe an irrelevant property, but changing the Address (if street name has changed, but company is still in the same physical location) is a far more serious problem.
Currently we are using ASP.NET MVC with EF CF for repository, but one of the demands is that we are able to easily switch, or add, another technology like WPF or WCF. Currently we are using Automapper to map DTO's to Domain entities and vice versa and DTO's are primary source for views, ie. we have no view models. Application is layered according to DDD principle, and mapping occurs in Service Layer.
Another demand is that we musn't create a initial entity in database and then fill the values, but an entire aggregate should be stored as a whole.
Any comments or suggestions are appreciated.
We also welcome any changes in demands (as this is an internal project, and not for a customer) and architecture, but only if it's absolutely neccessary.
Thank you.

Have you ever come across event sourcing? Sounds like it could be of use if you're interested in tracking the complete history of aggregates.

To be honest I would create another table that would be a change log inserting the old record and deleted records etc etc into it before updating the live data. Yes you are creating a lot of records but you are abstracting this data from live records and keeping this data as lean as possible.
Also when it comes to clean up and backup you have your live date and your changed / delete data and you can routinely back up and trim your old changed / delete and reduced its size depending on how long you have agreed to keep changed / delete data live with the supplier or business you are working with.
I think this would be the best way to go as your core functionality will be working on a leaner dataset and I'm assuming your users wont be wanting to check revision and deletions of records all the time? So by separating the data you are accessing it when it is needed instead of all the time because everything is intermingled.

Related

How to handle multiple customers with different SQL databases

Summary
I have a project with multiple existing MSSQL databases, I already created an Azure Analysis Service where I deployed my first Tabular Cube. I already tested to access the Analysis Service, worked perfectly.
Finally I have to duplicate the above described for ~90 databases (90 different customers).
I'm unsure how to organize this project and I'm not sure about the possibilities I have.
What I did
I already browsed the Internet to find some information, but I just found a single source where somebody asked a similar question, the first reply is what I was already thinking about, as I described below.
The last reply I don't really understand, what does he mean with one solution, is there another hierarchy above the project?
Question
A possibility would be to import each database as a source in the same project, but I think this means I have to import each table from this source, means finally 5*90 = 450 tables, I think this gets quickly outta control?
Also I thought about duplicating the whole Visual Studio Project folder for ~90 times for each customer, but at the moment I fail to find all references to change the name, but I think this wouldn't be to hard.
Is there an easier way to achieve my goal? Especially regarding maintainability.
Solution
I will make a completely new Database with all the needed tables. Inside those tables I copy the databases from all customers with a new column customerId. The data I'll transfer with a cyclic job, periodicity to define. Updates in already existing row in the customer database I handle with a trigger.

For this the best approach would be to create a staging database and import the data from the other databases, so your Tabular Model can read the data from it.
Doing 90+ databases is going to be a massive admin overhead and getting the cube to lad them effectively is going to be problematic. Move the data using SSIS/Data factory as you'll be able to better orchestrate the data movement, and incremental loads that way. That way if you need to add/remove/update data sources it is not done in the Cube, its all done at the database/data factory level.

Just use one database for all the customers and differentiate each customer with a customer_id column.

Strategy for updating data in databases (Oracle)

We have a product using Oracle, with about 5000 objects in the database (tables and packages). The product was divided into two parts, the first is the hard part: client, packages and database schema, the second is composed basically by soft data representing processes (Workflow) that can be configured to run on our product.
Well, the basic processes (workflow) are delivered as part of the product, our customers can change these processes and adapt them to their needs, the problem arises when trying to upgrade to a newer version of the product, then trying to update the database records data, there are problems for records deleted or modified by our customers.
Is there a strategy to handle this problem?

It is common for a software product to be comprised of not just client and schema objects, but data as well; typically it seems to be called "static data", i.e. it is data that should only be modified by the software developer, and is usually not modifiable by end users.
If the end users bypass your security controls and modify/delete the static data, then you need to either:
write code that detects, and compensates for, any modifications the end user may have done; e.g. wipe the tables and repopulate with "known good" data;
get samples of modifications from your customers so you can hand-code customised update scripts for them, without affecting their customisations; or
don't allow modifications of static data (i.e. if they customise the product by changing data they shouldn't, you say "sorry, you modified the product, we don't support you".
From your description, however, it looks like your product is designed to allow customers to customise it by changing data in these tables; in which case, your code just needs to be able to adapt to whatever changes they may have made. That needs to be a fundamental consideration in the design of the upgrade. The strategy is to enumerate all the types of changes that users may have made (or are likely to have made), and cater for them. The only viable alternative is #1 above, which removes all customisations.

Can DB2 tell a web-app when a table data is updated?

I have a table of non trivial size on a DB2 database that is updated X times a day per user input in another application. This table is also read by my web-app to display some info to another set of users. I have a large number of users on my web app and they need to do lots of fuzzy string lookups with data that is up-to-the-minute accurate. So, I need a server side cache to do my fuzzy logic on and to keep the DB from getting hammered.
So, what's the best option? I would hate to pull the entire table every minute when the data changes so rarely. I could setup a trigger to update a timestamp of a smaller table and poll that to see if I need refresh my cache, but that seems hacky to.
Ideally I would like to have DB2 tell my web-app when something changes, or at least provide a very lightweight mechanism to detect data level changes.

I think if your web application is running in WebSphere, setting up MQ would be a pretty good solution.
You could write triggers that use the MQ Series routines to add things to a queue, and your web app could subscribe to the queue and listen for updates.
If your web app is not in WebSphere then you could still look at this option but it might be more difficult.

A simple solution could be to have a timestamp (somewhere) for the latest change on to table.
The timestamp could be located in a small table/view that is updated by either the application that updates the big table or by an update-trigger on the big table.
The update-triggers only task would be to update the "help"-timestamp with currenttimestamp.
Then the webapp only checks this timestamp.
If the timestamp is newer then what the webapp has then the data is reread from the big table.
A "low-tech"-solution thats fairly non intrusive to the exsisting system.
Hope this solution fits your setup.
Regards
Sigersted

Having the database push a message to your webapp is certainly doable via a variety of mechanisms (like mqseries, etc). Similar and easier is to write a java stored procedure that gets kicked off by the trigger and hands the data to your cache-maintenance interface. But both of these solutions involve a lot of versioning dependencies, etc that could be a real PITA.
Another option might be to reconsider the entire approach. Is it possible that instead of maintaining a cache on your app's side you could perform your text searching on the original table?
But my suggestion is to do as you (and the other poster) mention - and just update a timestamp in a single-row table purposed to do this, then have your web-app poll that table. Similarly you could just push the changed rows to this small table - and have your cache-maintenance program pull from this table. Either of these is very simple to implement - and should be very reliable.

Client-server synchronization pattern / algorithm?

I have a feeling that there must be client-server synchronization patterns out there. But i totally failed to google up one.
Situation is quite simple - server is the central node, that multiple clients connect to and manipulate same data. Data can be split in atoms, in case of conflict, whatever is on server, has priority (to avoid getting user into conflict solving). Partial synchronization is preferred due to potentially large amounts of data.
Are there any patterns / good practices for such situation, or if you don't know of any - what would be your approach?
Below is how i now think to solve it:
Parallel to data, a modification journal will be held, having all transactions timestamped.
When client connects, it receives all changes since last check, in consolidated form (server goes through lists and removes additions that are followed by deletions, merges updates for each atom, etc.).
Et voila, we are up to date.
Alternative would be keeping modification date for each record, and instead of performing data deletes, just mark them as deleted.
Any thoughts?

You should look at how distributed change management works. Look at SVN, CVS and other repositories that manage deltas work.
You have several use cases.
Synchronize changes. Your change-log (or delta history) approach looks good for this. Clients send their deltas to the server; server consolidates and distributes the deltas to the clients. This is the typical case. Databases call this "transaction replication".
Client has lost synchronization. Either through a backup/restore or because of a bug. In this case, the client needs to get the current state from the server without going through the deltas. This is a copy from master to detail, deltas and performance be damned. It's a one-time thing; the client is broken; don't try to optimize this, just implement a reliable copy.
Client is suspicious. In this case, you need to compare client against server to determine if the client is up-to-date and needs any deltas.
You should follow the database (and SVN) design pattern of sequentially numbering every change. That way a client can make a trivial request ("What revision should I have?") before attempting to synchronize. And even then, the query ("All deltas since 2149") is delightfully simple for the client and server to process.

As part of the team, I did quite a lot of projects which involved data syncing, so I should be competent to answer this question.
Data syncing is quite a broad concept and there are way too much to discuss. It covers a range of different approaches with their upsides and downsides. Here is one of the possible classifications based on two perspectives: Synchronous / Asynchronous, Client/Server / Peer-to-Peer. Syncing implementation is severely dependent on these factors, data model complexity, amount of data transferred and stored, and other requirements. So in each particular case the choice should be in favor of the simplest implementation meeting the app requirements.
Based on a review of existing off-the-shelf solutions, we can delineate several major classes of syncing, different in granularity of objects subject to synchronization:
Syncing of a whole document or database is used in cloud-based applications, such as Dropbox, Google Drive or Yandex.Disk. When the user edits and saves a file, the new file version is uploaded to the cloud completely, overwriting the earlier copy. In case of a conflict, both file versions are saved so that the user can choose which version is more relevant.
Syncing of key-value pairs can be used in apps with a simple data structure, where the variables are considered to be atomic, i.e. not divided into logical components. This option is similar to syncing of whole documents, as both the value and the document can be overwritten completely. However, from a user perspective a document is a complex object composed of many parts, but a key-value pair is but a short string or a number. Therefore, in this case we can use a more simple strategy of conflict resolution, considering the value more relevant, if it has been the last to change.
Syncing of data structured as a tree or a graph is used in more sophisticated applications where the amount of data is large enough to send the database in its entirety at every update. In this case, conflicts have to be resolved at the level of individual objects, fields or relationships. We are primarily focused on this option.
So, we grabbed our knowledge into this article which I think might be very useful to everyone interested in the topic => Data Syncing in Core Data Based iOS apps (http://blog.denivip.ru/index.php/2014/04/data-syncing-in-core-data-based-ios-apps/?lang=en)

What you really need is Operational Transform (OT). This can even cater for the conflicts in many cases.
This is still an active area of research, but there are implementations of various OT algorithms around. I've been involved in such research for a number of years now, so let me know if this route interests you and I'll be happy to put you on to relevant resources.

The question is not crystal clear, but I'd look into optimistic locking if I were you.
It can be implemented with a sequence number that the server returns for each record. When a client tries to save the record back, it will include the sequence number it received from the server. If the sequence number matches what's in the database at the time when the update is received, the update is allowed and the sequence number is incremented. If the sequence numbers don't match, the update is disallowed.

I built a system like this for an app about 8 years ago, and I can share a couple ways it has evolved as the app usage has grown.
I started by logging every change (insert, update or delete) from any device into a "history" table. So if, for example, someone changes their phone number in the "contact" table, the system will edit the contact.phone field, and also add a history record with action=update, table=contact, field=phone, record=[contact ID], value=[new phone number]. Then whenever a device syncs, it downloads the history items since the last sync and applies them to its local database. This sounds like the "transaction replication" pattern described above.
One issue is keeping IDs unique when items could be created on different devices. I didn't know about UUIDs when I started this, so I used auto-incrementing IDs and wrote some convoluted code that runs on the central server to check new IDs uploaded from devices, change them to a unique ID if there's a conflict, and tell the source device to change the ID in its local database. Just changing the IDs of new records wasn't that bad, but if I create, for example, a new item in the contact table, then create a new related item in the event table, now I have foreign keys that I also need to check and update.
Eventually I learned that UUIDs could avoid this, but by then my database was getting pretty large and I was afraid a full UUID implementation would create a performance issue. So instead of using full UUIDs, I started using randomly generated, 8 character alphanumeric keys as IDs, and I left my existing code in place to handle conflicts. Somewhere between my current 8-character keys and the 36 characters of a UUID there must be a sweet spot that would eliminate conflicts without unnecessary bloat, but since I already have the conflict resolution code, it hasn't been a priority to experiment with that.
The next problem was that the history table was about 10 times larger than the entire rest of the database. This makes storage expensive, and any maintenance on the history table can be painful. Keeping that entire table allows users to roll back any previous change, but that started to feel like overkill. So I added a routine to the sync process where if the history item that a device last downloaded no longer exists in the history table, the server doesn't give it the recent history items, but instead gives it a file containing all the data for that account. Then I added a cronjob to delete history items older than 90 days. This means users can still roll back changes less than 90 days old, and if they sync at least once every 90 days, the updates will be incremental as before. But if they wait longer than 90 days, the app will replace the entire database.
That change reduced the size of the history table by almost 90%, so now maintaining the history table only makes the database twice as large instead of ten times as large. Another benefit of this system is that syncing could still work without the history table if needed -- like if I needed to do some maintenance that took it offline temporarily. Or I could offer different rollback time periods for accounts at different price points. And if there are more than 90 days of changes to download, the complete file is usually more efficient than the incremental format.
If I were starting over today, I'd skip the ID conflict checking and just aim for a key length that's sufficient to eliminate conflicts, with some kind of error checking just in case. (It looks like YouTube uses 11-character random IDs.) The history table and the combination of incremental downloads for recent updates or a full download when needed has been working well.

For delta (change) sync, you can use pubsub pattern to publish changes back to all subscribed clients, services like pusher can do this.
For database mirror, some web frameworks use a local mini database to sync server side database to local in browser database, partial synchronization is supported. Check meteror.

This page clearly describes mosts scenarios of data synchronization with patterns and example code: Data Synchronization: Patterns, Tools, & Techniques
It is the most comprehensive source I found, considering whole of delta syncs, strategies on how to handle deletions and server-to-client and client-to-server sync. It is a very good starting point, worth a look.

Separating Demo data in Live system

If we put aside the rights and wrongs of putting demo data into a live system for a minute (that's a whole separate discussion!), we are being asked to store some demo data in our live system so that it can be credibly demonstrated without the appearance of smoke + mirrors (we want to use the same login page for example)
Since I'm sure this is a challenge many other people must have - I'd be interested to know what approaches have people have devised to separating this data so that it doesn't get in the way of day to day operations on their systems?
As I alluded to above, I'm aware that this probably isn't best practice. :-)

Can you instead, segregate the data into a new database, and just redirect your connection strings (they're not hard-coded, right? right?) to point to the demo database. This way, live data isn't tainted, and your code looks identical. We actually do a three tier-deployment system this way, where we do local development, deploy to QC environments that have snapshots of the live data every few months, and then deploy to live when testing is complete.

FWIW, we're looking at using Oracle's row level security / virtual private database feature to seperate the demo data from the rest.

I've often seen it on certain types of live systems.
For example, point of sale systems in a supermarket: cashiers are trained on the production point of sale terminals.
The key is to carefully identify the test or training data. I wouldn't say that there's any explicit best practice for how to model this in a database - it's going to be applicaiton specific.
You really have to carefully define the scope of what is covered by the test/training scenarios. For example, you don't want the training/test transactions to appear in production reports (but you may want to be able to create reports with this data for training/test purposes).

Completely disagree with Joe. Oracle has a tool to do this regardless of implementation. Before I read your answer I was going to say VPD... But that could have an impact on Production.
Remember Every table in a query changes from
SELECT * FROM tableA
to
SELECT * FROM (SELECT * FROM tableA WHERE Data_quality = 'PROD' <or however you do it>
Every table with a policy that is...
So assuming your test data has to span EVERY table, every table will have to have a policy and every table will be filtered before a SQL can begin working.
You can even hide that column from the users. You'll need to write the policy with some deftness if you do. You'll have to create that value based on how the data is inserted and expose the column to certain admin accounts for maintenance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio