Plugin to trigger only once on create / update of multiple records in entity / table x - dynamics-crm

We have an entity which holds 3 types of Records A, B, C. Record type A is the parent of B (each A can be the parent to multiple B records) and further B is the parent of C (each B can be the parent to multiple C records). On creation / update of every record C, CalculateCommercials plugin will run that will pull all the sibling C records under a given record B and aggregate / roll-up the totals and update that parent record B with those rolled-up sum / totals. Same thing happens on update of B record totals which will roll-up the totals to grand-parent A record. The problem here is, when we create/update multiple C records under a given parent B this will trigger multiple CalculateCommercials plugin instances which is a bit inefficient & resource intensive. Is there a better approach where we let the CalculateCommercials to trigger only once irrespective of the number of C records being created / updated?
For example, if we create / update 10 C records at a given time we want the CalculateCommercials plugin to run only once which will roll-up the totals to B and again only once to update A, instead of 10 times (currently its triggering 10 CalculateCommercials plugin instances to roll-up to B, which in-turn trigger another 10 more instances to update grand-parent record type C).
Sometimes this chain of auto-triggers resulting in the exceeding of 2 minute time limit for the plugin instance. Is there a better approach to simplify the rolling up of totals to parent B and then on to A?

A lot depends upon how the rollup fields are used, and their consistency requirements.
Simplest approach is to plan for eventual consistency rather than strict consistency. The update-grandchild plugin would merely mark a "dirty" field in the grandchild entity, which requires no additional DB locking. Separately, a scheduled workflow or Power Automate flow runs every 'N' minutes, which finds all the dirty grandchildren, updates their parents' rollup fields, and resets the dirty fields. Because this workflow is scheduled and nothing else takes a write lock on the rollup fields, there is little contention on the parent and grandparent records. However, it means that rollup values are always a few minutes out-of-date if the grandchildren are constantly changing.
D365 seems to implement a "rollup field" concept that does exactly this: https://www.encorebusiness.com/blog/rollup-fields-in-microsoft-dynamics-365-crm/ The scheduled task runs every hour by default, but can be configured by an admin.
To improve the update latency, you could make the flow on-demand or triggered by updates to the relevant grandchild fields, but have it check & update a "flow already triggered" table to ensure that multiple flow instances aren't trying to do the updates simultaneously. The "winning" instance of the flow sets the "running" flag, loops repeatedly until it finds no more dirty records, and then clears the "running" flag. The "losing" instances terminate immediately when they see that another instance is active.
This pattern requires a lot more care to handle potential race conditions related to records being marked dirty after the running flag is cleared, as well as transient errors that might terminate the flow before it completes. Again, a simple solution is to schedule a "sweeper" instance once per hour, or day.
If you need it to be strictly consistent, then this whitepaper might help: https://download.microsoft.com/download/D/6/6/D66E61BA-3D18-49E8-B042-8434E64FAFCA/ScalableDynamicsCRMCustomizations.docx It discusses the tradeoffs inherent in various locking and registration options for plugins.
If the rollup fields are rarely consumed, then it might be more efficient to recalculate the field only when a client requests that field while reading the parent or grandparent entity, using a separate plugin on those entities. You'd want to ensure that clients don't request that field unless they need it.

Related

elasticsearch bulk ingestion how to avoid updates

Within my product I use elasticsearch for storing CDRs (call them txn logs, if you will). My transactions are asynchronous and happen at a very fast rate i.e. around 5000 txns/sec. My transaction involves submitting request to a network entity, and later at some other point of time I receive the response.
The data ingestion technique to ES, earlier involved two phase operations viz., 1) add an entry into ES as soon as I submit to the network layer; 2) when I get response, then update the previous entry with additional status such as delivery succeeded.
I am doing this with bulk insertion method, in which the bulk records contain both inserts and updates. As a result the ingestion is very very slow, which ended up hogging / halting my application. Later, we changed the ingestion technique in such a way that we only insert to elastic when we get final response. Till such time we store the data in a redis store. But this has disadvantages of data loss and non-realtime reports.
So, I was looking at some option like having 2 indexes for the same record. Parent index will have all data, and the child record will have delivery status. I don't know if this is possible. I studied about nested queries and has-child, has-parent queries. What I am unsure is, can I insert the parent and child data at separate points in time, without having to use update. Or should I create two different records with common txn-id without worrying about parent/child?
What is the best way?

How to lock on select and release lock after update is committed using spring?

I have started using spring from last few months and I have a question on transactions. I have a java method inside my spring batch job which first does a select operation to get first 100 rows with status as 'NOT COMPLETED' and does a update on the selected rows to change the status to 'IN PROGRESS'. Since I'm processing around 10 million records, I want to run multiple instances of my batch job and each instance has multiple threads. For a single instance, to make sure two threads are not fetching the same set of records, I have made my method as synchonized. But if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection. Below is the example shown. Transaction 1 has fetched 100 records and meanwhile Transaction2 also fetched 100 records but if I enable locking transaction 2 waits until transaction 1 is updated and committed. But Transaction 2 again does the same update.
Is there any way in spring to make transaction 2's select operation to wait until transaction 1's select is completed ?
Transaction1 Transaction2
fetch 100 records
fetch 100 records
update 100 records
commit
update 100 records
commit
#Transactional
public synchronized List<Student> processStudentRecords(){
List<Student> students = getNotCompletedRecords();
if(null != students && students.size() > 0){
updateStatusToInProgress(students);
}
return student;
}
Note: I cannot perform update first and then select. I would appreciate if any alternative approach is suggested ?
Transaction synchronization should be left to the database server and not managed at the application level. From the database server point of view, no matter how many JVMs (threads) you have, those are concurrent database clients asking for read/write operations. You should not bother yourself with such concerns.
What you should do though is try to minimize contention as much as possible in the design of your solution, for example, by using the (remote) partitioning technique.
if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection
Partitioning data will by design remove all these problems. If you give each instance a set of data to work on, there is no chance that a worker would select the same of records of another worker. Michael gave a detailed example in this answer: https://stackoverflow.com/a/54889092/5019386.
(Logical) Partitioning however will not solve the contention problem since all workers would read/write from/to the same table, but that's the nature of the problem you are trying to solve. What I'm saying is that you don't need to start locking/unlocking the table in your design, leave this to the database. Some database severs like Oracle can write data of the same table to different partitions on disk to optimize concurrent access (which might help if you use partitioning), but again that's Oracle's business, not Spring's (or any other framework) business.
Not everybody can afford Oracle so I would look for a solution at the conceptual level. I have successfully used the following solution ("Pseudo" physical partitioning) to a problem similar to yours:
Step 1 (in serial): copy/partition unprocessed data to temporary tables (in serial)
Step 2 (in parallel): run multiple workers on these tables instead of the source table with millions of rows.
Step 3 (in serial): copy/update processed data back to the original table
Step 2 removes the contention problem. Usually, the cost of (Step 1 + Step 3) is neglectable compared to Step 2 (even more neglectable if Step 2 is done in serial). This works well if the processing is the bottleneck.
Hope this helps.

Version number in event sourcing aggregate?

I am building Microservices. One of my MicroService is using CQRS and Event sourcing. Integration events are raised in the system and i am saving my aggregates in event store also updating my read model.
My questions is why we need version in aggregate when we are updating the event stream against that aggregate ? I read we need this for consistency and events are to be replayed in sequence and we need to check version before saving (https://blog.leifbattermann.de/2017/04/21/12-things-you-should-know-about-event-sourcing/) I still can't get my head around this since events are raised and saved in order , so i really need concrete example to understand what benefit we get from version and why we even need them.
Many thanks,
Imran
Let me describe a case where aggregate versions are useful:
In our reSove framework aggregate version is used for optimistic concurrency control.
I'll explain it by example. Let's say InventoryItem aggregate accept commands AddItems and OrderItems. AddItems increases number of items in stock, OrderItems - decreases.
Suppose you have an InventoryItem aggregate #123 with one event - ITEMS_ADDED with quantity of 5. Aggregate #123 state say there are 5 items in stock.
So your UI is showing users that there are 5 items in stock. User A decide to order 3 items, user B - 4 items. Both issue OrderItems commands, almost at the same time, let's say user A is first by couple milliseconds.
Now, if you have a single instance of aggregate #123 in memory, in the single thread, you don't have a problem - first command from user A would succeed, event would be applied, state say quantity is 2, so second command from user B would fail.
In a distributed or serverless system where commands from A and B would be in separate processes, both commands would succeed and bring aggregate into incorrect state if we don't use some concurrency control. There several ways to do this - pessimistic locking, command queue, aggregate repository or optimistic locking.
Optimistic locking seems to be simplest and most practical solution:
We say that every aggregate has a version - number of events in its stream. So our aggregate #123 has version 1.
When aggregate emits an event, this event data has an aggregate version. In our case ITEMS_ORDERED events from users A and B will have event aggregate version of 2. Obviously, aggregate events should have versions to be sequentially increasing. So what we need to do is just put a database constraint that tuple {aggregateId, aggregateVersion} should be unique on write to event store.
Let's see how our example would work in a distributed system with optimistic concurrency control:
User A issues a command OrderItem for aggregate #123
Aggregate #123 is restored from events {version 1, quantity 5}
User B issues a command OrderItem for aggregate #123
Another instance of Aggregate #123 is restored from events (version 1, quantity 5)
Instance of aggregate for user A performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} is written to event store.
Instance of aggregate for user B performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} it attempts to write it to event store and fails with concurrency exception.
On such exception command handler for user B just repeats the whole procedure - then Aggregate #123 would be in a state of {version 2, quantity 2} and command will be executed correctly.
I hope this clears the case where aggregate versions are useful.
Yes, this is right. You need the version or a sequence number for consistency.
Two things you want:
Correct ordering
Usually events are idempotent in nature because in a distributed system idempotent messages or events are easier to deal with. Idempotent messages are the ones that even when applied multiple times will give the same result. Updating a register with a fixed value (say one) is idempotent, but incrementing a counter by one is not. In distributed systems when A sends a message to B, B acknowledges A. But if B consumes the message and due to some network error the acknowledgement to A is lost, A doesn't know if B received the message and so it sends the message again. Now B applies the message again and if the message is not idempotent, the final state will go wrong. So, you want idempotent message. But if you fail to apply these idempotent messages in the same order as they are produced, your state will be again wrong. This ordering can be achieved using the version id or a sequence. If your event store is an RDBMS you cannot order your events without any similar sort key. In Kafka also, you have the offset id and client keeps track of the offset up to which it has consumed
Deduplication
Secondly, what if your messages are not idempotent? Or what if your messages are idempotent but the consumer invokes some external services in a non-deterministic way. In such cases, you need an exactly-once semantics because if you apply the same message twice, your state will be wrong. Here also you need the version id or sequence number. If at the consumer end, you keep track of the version id you have already processed, you can dedupe based on the id. In Kafka, you might then want to store the offset id at the consumer end
Further clarifications based on comments:
The author of the article in question assumed an RDBMS as an event store. The version id or the event sequence is expected to be generated by the producer. Therefore, in your example, the "delivered" event will have a higher sequence than the "in transit" event.
The problem happens when you want to process your events in parallel. What if one consumer gets the "delivered" event and the other consumer gets the "in transit" event? Clearly you have to ensure that all events of a particular order are processed by the same consumer. In Kafka, you solve this problem by choosing order id as the partition key. Since one partition will be processes by one consumer only, you know you'll always get the "in transit" before delivery. But multiple orders will be spread across different consumers within the same consumer group and thus you do parallel processing.
Regarding aggregate id, I think this is synonymous to topic in Kafka. Since the author assumed RDBMS store, he needs some identifier to segregate different categories of message. You do that by creating separate topics in Kafka and also consumer groups per aggregate.

How do I ensure consistency of aggregates with high availability?

My team needs to find a solution to the following problem:
Our application allows users to view total sales for the enterprise, totals by product, totals by region, totals by region x product, totals by regions x division, etc. You get the idea. There are so many values that need to be aggregated to get many of those totals that they cannot be computed on the fly - we have to pre-aggregate them to provide decent response times, a process that takes about 5 minutes.
The problem, which we thought was a common one but can find no references to, is how to allow updates to various sales without shutting off the users. Also, the users cannot accept eventual consistency - if they drill down on a total of 12 they better see numbers that add up to 12. So we need Consistency + Availability.
The best solution we've come up with so far is to direct all queries to a redundant database, "B" (optimized for queries) while updates are directed to the primary database, "A". When we decide to spend the 5 minutes to update all the aggregates, we update database "C", which is yet another redundant database just like "B". Then, new user sessions get directed to "C", while existing user sessions continue to use "B". Eventually, warning anyone left using "B", we kill the sessions on "B" and re-aggregate there, swapping the roles of "B" and "C". Typical drain-stop scenario.
We are surprised that we cannot find any discussion of this and are concerned that we are over-engineering this problem or maybe it's not the problem we think it is. Any advice is greately appreciated.
This was an interesting problem so I thought about it on the train, and I came up with the idea of storing a timestamp for each row in the database that you aggregate over. (I think this technique has a name, but it escapes me and googling isn't finding it...)
The timestamp would indicate when this row was inserted. In addition:
-If rows can be updated, then you will have two 'versions' of the row at once, one more recent than the other.
-If rows can be deleted, then there will need to be a 'deleted version' row that specifies when it was deleted.
Now you can do things such as:
1) Say you update the aggregates at Jan 1 2000 midnight. You can have views of the table return the table's data as though it was Jan 1 2000 midnight, ignoring all inserts/updates/deletes more recent than that. Now the aggregates are as up to date as the data in the view AND you can keep adding data to the underlying table.
2) I don't know how feasible/easy to guarantee it's reliable this would be, but you could have 'differentially computed aggregates' where on Jan 2 2000 midnight, you take the aggregates of Jan 1 2000 midnight and update them only with the data that has been changed since that time - saving you from recomputing so much historical data. (Of course, it gets hairier once you consider rows being updated or deleted that are older than 24 hours)
3) Whenever you bring your aggregates up to date, you can merge updated and deleted rows with their older version and get rid of the older version, so you only have to keep duplicates of rows around when you need them to separate rows that have been aggregated and rows that aren't (this also means that, for instance, if all your aggregates run at once, and you update a row three times in quick succession, you only need to keep the most recent update-indicating row)
If updates cannot be computed on the fly, then caching of results sets as you are doing in another database helps solve the issue of availability with faster response times.
For consistency, you may be able to make use of some form of transaction isolation. For example, MySQL supports a number of different transaction levels, of which REPEATABLE READ may go close to providing you with some consistency in a single transaction. If a transaction can be left open for multiple requests as the users drill down to see the data, they effectively see a snapshot of the database state as of the first request.
In a more generic sense, you're just after a handle which to the data which is provided by the client to indicate a consistent set. As in Patashu's answer, the handle for a client requesting a set of aggregates could be time based. The first stage of client interaction would be to get a handle to the latest aggregate data, eg the current time. If would then pass that handle with each request. As requests are made of the server, it uses the handle to determine which set of aggregate data to return. Rather than having both server "B" and "C", all aggregate data could be stored in server "B", with all aggregate data containing the handle information. This then allows requests to a single server for aggregate data both new and old. At some point, old aggregate data could be purged from "B".
Perhaps a search on transaction isolation will turn up more results for discussion on consistency.
I think you're looking for Data Warehousing concepts
In computing, a data warehouse or enterprise data warehouse (DW, DWH,
or EDW) is a database used for reporting and data analysis. It is a
central repository of data which is created by integrating data from
one or more disparate sources. Data warehouses store current as well
as historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
...
Unlike the ETL-based data warehouse, the integrated source data
systems and the data warehouse are all integrated since there is no
transformation of dimensional or reference data. This integrated data
warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated
source data systems.

Is there any way to update the most recent version of row through SQL?

I read that Oracle maintains row versions to deal with concurrency. I want to run an update query on a very big real-time database but this update job must alter the most recent version of the row.
Is this possible via PL/SQL or simply SQL?
Edited below **
Let me clear the scenario, the real-life issue that we faced on a very large database. Our client is a well-known cell phone service provider.
Our database has a table that manages records of the current balance left on the customer's cell phone account. Among the other columns of the table, one column stores the amount of recharge done and one other column manages the current active balance left.
We have two independent PL/SQL scripts. One script is automatically fired when the customer recharges his phone and updates his balance.
The second script is about deduction certain charges from the customers account. This is a batch job as it applies to all the customers. This script is scheduled to run at certain intervals of a day. When this script is run, it loads 50,000 records in the memory, updates certain columns and performs bulk update back to the table.
The issue happened is like this:
A customer, whose ID is 101, contacted his local shop to get his phone recharged. He pays the amount. But till the time his phone was about to recharge, the scheduled time of the second script fired the second script. The second script loaded the records of 50,000 customers in the memory. In this in-memory records, one of the record of this customer too.
Till the time the second script's batch update finishes, the first script successfully recharged the customer's account.
Now what happened is that is the actual table, the column: "CurrentAccountBalance" gets updated to 150, but the in-memory records on which the second script was working had the customer's old balance i.e, 100.
The second script had to deduct 10 from the column: "CurrentAccountBalance". When, according to actual working, the customer's "CurrentAccountBalance" should be 140, this issue made his balance 90.
Now how to deal with this issue.
I think what you want is what is anyway happening if you UPDATE.
It is true that Oracle keeps old data for a while, but just to support consistent reads. That is, read operations that see only the state as it was at start of the transaction--even if the data was overwritten in the meantime. It's called Multi Version Concurrency Control and can be controlled by the Transaction Isolation Level.
You can explicitly request the most recent one by selecting `FOR UPDATE; that adds a lock for the record so that nobody else can update it in the meanwhile (until your transaction ends).
However, if you need to write anything (e.g., UPDATE) Oracle works always on the most recent version.
As #Markus suggested, you have a race condition. If you're loading records into memory and working on them before updating the rows in the table, and something else may try to update them in the meantime, then you need to lock them while you work on them. (I'm assuming whatever you're doing is too complicated to do a simple one-step update). Something like this would work:
DECLARE
CURSOR c is SELECT * FROM current_balance_table FOR UPDATE;
BEGIN
FOR r IN c LOOP
/* Do whatever calculations you need */
new_value := r.CurrantAccountBalance - 10;
UPDATE current_balance_table SET CurrentAccountBalance = new_value
WHERE CURRENT OF c;
END LOOP:
END;
The problem now is that all records are locked for the duration of the loop, so your customer in the shop will either not be able to update their balance, or will have a log wait before the update takes effect - though when it does it will work on the updated value you stored. So you'd have to break the cursor up into small chunks, balancing performance of your script against the impact on anyone else trying to update the same table.
One option would be to have an outer cursor selecting all the customers you're targeting with no locking, and then an inner one that locks the balance record for that customer while that row is calculated and updated. You'd have to commit after each inner loop to release the lock for that row. This involves a lot more locking/unlocking and committing after every row update slows things down a lot. But it minimises the impact on the individual customer in the shop, as only a single row is locked at a time and the length of time that is locked is minimised. So, you need to find the right balance.

Resources