I am repeating my unanswered question about MERGE operation. As I understand, to do MERGE md-sal has to read a tree into memory. Is it right? Can this read data be used for another MERGE within the same transaction or within the same transaction chain? If I need to do lots of merges of subtrees, will it improve performance if within the same transaction chain I preliminary read the entire tree that includes all these subtrees?
The tree is already fully stored in memory so it does not need to read it per transaction. When a transaction is created, it merely takes a snapshot of the current tree which is very fast. Subsequent transaction operations work on the snapshot, including reads. When a transaction is committed, the updates in the snapshot are validated and applied to in-memory tree.
Related
My back-end (Java) is heavily relying on Tree structures with strong inheritance. Conflict resolution is complex so I am looking to test way to simply block users when the propagation of changes in higher nodes has not yet reached the current element.
Hierarchies are represented through both Materialized Paths and Adjacency Lists for performance reasons. The goal would be to;
Prevent update (bad request) when API requests the change of a node with pending propagation
Inform user through the DTO (e.g. isLocked attribute) when they retrieve a node with pending propagation
Propagation is a simple matter of going through all nodes in a top-down fashion, previously level-by-level (which would have been easier) but now is no longer orchestrated: each node sends the message to its children.
At the moment I have two ideas I do not like:
Add a locked flag on each node (persisted in DB), toggle it to true for all descendants of a modified node, then each node can be unlocked after being processed.
Leverage the materialized path and record the current unprocessed node in a new table. If node D with path A.B.C.D is queried, any of the 4 path nodes in DB means the node has not been processed yet and should be locked.
I do not like approach 1 because it needs to update all entities twice, although retrieving the list would be quick with the Materialized Path.
I do not like approach 2 because:
The materialized path is stored as VARCHAR2, thus the comparison cannot be done in DB and I would first have to unwrap the path to get all nodes in the path and then query the DB to check for any of the elements in hierarchy.
Trees can be quite large with hundreds of children per node, tens of thousands of nodes per tree. Modifying the root would create a huge amount of those temporary records holding the current 'fringe' of the propagation. That many independent DB calls is not ideal, especially since nodes can often be processed in less than 10 ms. I'd probably quickly encounter a bottleneck and bad performances.
Is there another approach that could be taken to identify whether a propagation has reached a node? Examples, comparisons, ... Anything that could help decide on the best way to approach this problem.
Lets say I have a function which carries out a lot of CRUD operations, and also assume that this function is going to get executed without any exception (100% success). Is it better to have a transaction for the entire function or transaction commits for each CRUD operation. Basically, I wanted to know whether using many transaction commits has an impact on the memory and time consumption while executing the function which has a lot of CRUD operations.
Transaction boundaries should be defined by your business logic.
If your application has 100 CRUD operations to do, and each is completely independent of the others, maybe a commit after each is appropriate. Think about this: is it OK for a user running a report against your database to see only half of the CRUD operations?
A transaction is a set of updates that must all happen together or not at all, because a partial transaction would represent an inconsistent or inaccurate state.
Commit at the end of every transaction - that's it. No more, no less. It's not about performance, releasing locks, or managing server resources. Those are all real technical issues, but you don't solve them by committing halfway through a logical unit of work. Commit frequency is not a valid "tuning trick".
EDIT
To answer your actual question:
Basically, I wanted to know whether using many transaction commits has an impact on the memory and time consumption while executing the function which has a lot of CRUD operations.
Committing frequently will actually slow you down. Every time you do a regular commit, Oracle has to make sure that anything in the redo log buffers is flushed to disk, and your COMMIT will wait for the that process to complete.
Also, there is little or no memory savings in frequent commits. Almost all your transaction's work and any held locks are written to redo log buffers and/or database block buffers in memory. Oracle will flush both of those to disk in background as often as it needs to in order to manage memory. Yes, that's right -- your dirty, uncommitted database blocks can be written to disk. No commit necessary.
The only resource that a really huge transaction can blow out is UNDO space. But, again, you don't fix that problem by committing half way through a logical unit of work. If your logical unit of work is really that huge, size your database with an appropriate amount of UNDO space.
My response is "it depends." Does the transaction involve data in only one table or several? Are you performing inserts, updates, or deletes. With an INSERT no other session can see your data till it is committed so technically no rush. However if you update a row on a table where the exact same row may need to be updated by another session in short order you do not want to hold the row for any longer than absolutely necessary. What constitutes a logic unit of work, how much UNDO the table and index changes involved consume, and concurrent DML demand for the same rows all come into play when choosing the commit frequency.
Two-phase commits are supposed to suffer from blocking problems. Is that the case with CockroachDB, and if not, how is it avoided?
Summary: 2-phase commits are blocking, so it is important to keep the thing that is being 2-phase committed as "small" as possible, so that the set of all actions that are blocked is minimal. CockroachDB does this using MVCC with intents, 2-phase committing only on a single intent. Because CockroachDB provides serializable transactions, it reorders transaction timestamps to minimize blocking only to where absolutely necessary.
Longer answer
2-phase commits are blocking after the first phase, while all participants wait for a reply from the coordinator as to whether the second phase is to be committed or aborted. During this time period, participants that have already sent a "Yes" vote cannot unilaterally revoke their vote, but also cannot treat it as committed (as the coordinator might get back with an abort). So they are forced to block all subsequent actions that need to concretely know what the state of this transaction is. The key in the above sentence is in the "need": it is on us to design our system to reduce that set to the bare minimum. CockroachDB uses write intents and [MVCC] to minimize these dependencies.
Consider a naïve implementation of a distributed (multi-key) transactional key-value store: I wish to transactionally commit some write transaction t1. t1 spans many keys across many machines, but of particular concern is that it writes k1 = v2. k1 is on machine m1 (let's say k1=v1 was the previous value).
Since t1 spans many keys on many machines, all of them are involved in a 2-phase commit transaction. Once that 2-phase transaction is begun, we have to note that we have an intent to write k1=v2, and the status of the transaction is unknown (the transaction may abort, because one of the other writes cannot proceed).
Now if some other transaction t2 comes along which wants to read the value of k1, we simply cannot give that transaction an authoritative answer, until we know the final result of the 2-phase commit. t2 is blocked.
But, we (and CockroachDB) can do better. We can keep multiple versions of values for each key, and have a concurrency control mechanism to keep all of these versions in order. Namely, we can assign our transactions timestamps, and have our writes look (loosely) as follows:
`k1 = v1 committed at time=1`
`k1 = v2 at time=110 INTENT (pending transaction t1)`
Now, when t2 comes along, it has an option: it can choose to do the read at time<=109, which would not be blocked on t1. Of course, some transactions cannot do this (if say, they also are distributed, and there's a different component that simply requires a higher timestamp). Those transactions will be blocked. But in practice, this frees up the database to assign timestamps such that many types of transactions can proceed.
As the other answer says, Cockroach Labs has a post about CockroachDB's use of MVCC here, which explains some further details as well.
CockroachDB has a long blog post on how it uses 2-phase commit without locking here: https://www.cockroachlabs.com/blog/how-cockroachdb-distributes-atomic-transactions/
The part that deals most with the prevention of locking is its use of "write intents" (Stage: Write Intents is the heading in the blog post).
I am working o a big DB driven application that sometimes needs a huge data import. Data is imported from excel spreadsheets and at the start of the proces (for about 500 rows) the data is processed relatively quicly, but lates slows down significantly. Import generates 6 linked entites per row of excel that are flushed after processing every line. My guess is that all those entities are getting cached by doctrine and just build up. My idea is to clear out all that cach every 200 rows but I could not find how to clear it from within the code (console is not an option at this stage). Any assistance or links would be much appreciated.
I suppose that the cause may lie not in Doctrine but in the database transaction log buffer size. The documentation says
A large log buffer enables large transactions to run without a need to write the log to disk before the transactions commit. Thus, if you have big transactions, making the log buffer larger saves disk I/O.
Most likely you insert your data in one big transaction. When the buffer is full, it is written to disk which is normally slower.
There are several possible solutions.
Increase buffer size so that the transaction fits into the buffer.
Split the transaction into several parts that fit into the buffer.
In the second case keep in mind that each transaction needs time as well, so wrapping each insert in a separate transaction will reduce performance as well.
I recommend to wrap about 500 rows in a transaction because this seems to be a size that fits in the buffer.
I'm no DBA, I just want to learn about Oracle's Multi-Version Concurrency model.
When launching a DML operation, the first step in the MVCC protocol is to bind a undo segment. The question is why one undo segment can only serve for one active transaction?
thank you for your time~~
Multi-Version Concurrency is probably the most important concept to grasp when it comes to Oracle. It is good for programmers to understand it even if they don't want to become DBAs.
There are a few aspects but to this, but they all come down to efficiency: undo management is overhead, so minimizing the number of cycles devoted to it contributes to the overall performance of the database.
A transaction can consist of many statements and generate a lot of undo: it might insert a single row, it might delete thirty thousands. It is better to assign one empty UNDO block at the start rather than continually scouting around for partially filled blocks with enough space.
Following one from that, sharing undo blocks would require the kernel to track of usage at a much finer granularity, which is just added complexity.
When the transaction completes the undo is released (unless, see next point). The fewer blocks the transaction has used the fewer latches have to be reset. Plus, if the blocks are shared we would have to free shards of a block, which is just more effort.
The key thing about MVCC is read consistency. This means that all the records returned by a longer running query will appear in the state they had when the query started. So if I issue a SELECT on the EMP table which takes fifteen minutes to run and halfway through you commit an update of all the salaries I won't see your change, The database does this by retrieving the undo data from the blocks your transaction used. Again, this is a lot easier when all the undo data is collocated in a one or two blocks.
"why one undo segment can only serve for one active transaction?"
It is simply a design decision. That is how undo segments are designed to work. I guess that it was done to address some of the issues that could occur with the previous rollback mechanism.
Rollback (which is still available but deprecated in favor of undo) included explicit creation of rollback segments by the DBA, and multiple transactions could be assigned to a single rollback segment. This had some drawbacks, most obviously that if one transaction assigned to a given segment generated enough rollback data that the segment was full (and could no longer extend), then other transactions using the same segment would be unable to perform any operation that would generate rollback data.
I'm surmising that one design goal of the new undo feature was to prevent this sort of inter-transaction dependency. Therefore, they designed the mechanism so that the DBA sizes and creates the undo tablespace, but the management of segments within it is done internally by Oracle. This allows the use of dedicated segments by each transaction. They can still cause problems for each other if the tablespace fills up (and cannot autoextend), but at the segment level there is no possibility of one transaction causing problems for another.