Oracle v19: can ongoing transactions block concurrent deletes on involved tables for extended periods? - oracle

We have a severe issue with threads hanging in operations to an Oracle DB (v19, connected to via JDBC connections).
The situation frequently happens while our application runs a big transaction within which it does a lot of major (i.e. quite complicated, lots of joins, etc.) queries and then updates a bunch of rows. These transactions can take several minutes.
As we were able to analyze so far the transaction processing blocks other concurrent tasks when they try to delete individual entries from tables that are involved in said transaction. Concurrent selects and also updates to these same tables work fine! It's only deletes that have issues! And, as we were able to "proof", this happens even for deletes of individual entries that for sure do not interfere with or touch on any entry involved in the ongoing transaction.
While we first suspected Hibernate to interfere and do funny things for deletions we had to learn that even deletes executed via SQLDeveloper (i.e. triggered "manually" by a completely unrelated DB session and client) do hang during such periods.
To us it almost seems as if an ongoing transaction does not only lock specific rows from manipulation but locks entire tables.
But can that really be that a transaction block entire tables from concurrent delete operations for extended periods?
We think that would be absurd but - as we had to learn and can easily reproduce - deleting entries from tables touched by our long-running transaction invariably hang. Several times we also witnessed that - as soon as the transaction finishes - those deletes that haven't timed out, yet, continue and run to completion.
We are not aware of doing anything weird or unusual in our Hibernate-based application. We certainly don't fiddle with any locking mechanism or such. Any idea or hint what could cause these hangs and/or in which direction to investigate further to resolve this?
Later addition:
We are currently considering the following work-around: we add a column to these tables where we mark entries as being "to-be-deleted" (instead of actually deleting them as we do now). We then run a regular job during times (e.g. nightly) which actually deletes these entries. We "only" need to make sure that no transaction is ever executed on these tables while that delete-job runs.
I really hate that approach, esp. since it will require to add another condition to many queries to exclude those "virtually deleted" entries but we have no better idea so far.

Related

Can a Oracle query done after a commit, return values prior to the commit when such commit is done with COMMIT_WRITE = NOWAIT?

I have a 3th party Java library that in a moment, gets a JDBC connection, starts a transaction, does several batch updates with PreparedStatement.addBatch(), executes the batch, commits the transaction and closes the connection. Almost immediately after (in the span of <10 milliseconds), the library gets another connection and queries one of the records affected by the update.
For the proper functioning of the library, that query should return the updated record. However, in some rare cases, I'm getting (using P6Spy) that the query is returning the record with its values before the update (and the library fails in some point forwards due to unexpected data).
I'm trying to understand why this would happen, and then I found that in my database (Oracle 19c) there is a parameter COMMIT_WAIT that basically gives the possibility that a call to a commit doesn't block until the commit is finished, obtaining an asynchronous commit. So I used the SHOW PARAMETERS to see the value of that parameter and I found out that COMMIT_WAIT is set up to NOWAIT (also, COMMIT_LOGGING was set up to BATCH).
I began to speculate if what was happening was that the call to commit() just started the operation (without waiting for it to finish), and perhaps the next query occurred while the operation was still in progress, returning the value of the record before the transaction. (The isolation level for all connections is Connection.TRANSACTION_READ_COMMITTED)
Can COMMIT_WAIT set up to NOWAIT cause that kind of scenario? I read that the use of NOWAIT has a lot of risks associated with it, but mostly they refers to things like loss of durability if the database crashes.
Changing the commit behavior should not affect database consistency and should not cause wrong results to be returned.
A little background - Oracle uses REDO for durability (recovering data after an error) and uses UNDO for consistency (making sure the correct results are always returned for any point-in-time). To improve performance, there are many tricks to reduce REDO and UNDO. But changing the commit behavior doesn't reduce the amount of logical REDO and UNDO, it only delays and optimizes the REDO physical writes.
Before a commit happens, and even before your statements return, the UNDO data used for consistency has been written to memory. Changing the commit behavior won't stop the changes from making their way to the UNDO tablespace.
Per the Database Reference for COMMIT_WAIT, "Also, [the parameter] can violate the durability of ACID (Atomicity, Consistency, Isolation, Durability) transactions if the database shuts down unexpectedly." Since the manual is already talking about the "D" in ACID, I assume it would also explicitly mention if the parameter affects the "C".
On the other hand, the above statements are all just theory. It's possible that there's some UNDO optimization bug that's causing the parameter to break something. But I think that would be extremely unlikely. Oracle goes out of its way to make sure that data is never lost or incorrect. (I know because even when I don't want REDO or UNDO it's hard to turn them off.)

Table Locking in PostgreSQL

I have a PL/pgSQL function which takes data from a staging table to our target table. The process executes every night. Sometimes due to server restart or some maintenance issues we get the process executed manually.
The problem I am facing: whenever we start the process manually after 7 AM, it takes almost 2 hours to complete (read from staging table and insert into the target table). But whenever it executes as per schedule, i.e., before 7 AM, it takes 22-25 minutes on average.
What could be the issue? If required, I can share my function snippet here.
The typical reason would be general concurrent activity in the database, which competes for the same resources as your function and may cause lock contention. Check your DB log for activities starting around 7 a.m.
The Postgres Wiki on lock monitoring
A function always runs as a single transaction. Locks are acquired along the way and only released at the end of a transaction. This makes long running functions particularly vulnerable to lock contention.
You may be able to optimize general performance as well as behavior towards concurrent transactions to make it run faster. Or more radically: if at all possible, split your big function in separate parts, which you call in separate transactions.
PostgreSQL obtain and release LOCK inside stored function
How to split huge updates:
How do I do large non-blocking updates in PostgreSQL?
There are additional things to consider when packing multiple big operations into a single function:
Execute multiple functions together without losing performance

Oracle database as a single synchronization point for two separate web applications

I am considering using an Oracle database to synchronize concurrent operations from two or more web applications on separate servers. The database is the single infrastructure element in common for those applications.
There is a good chance that two or more applications will attempt to perform the same operation at the exact same moment (cron invoked). I want to use the database to let one application decide that it will be the one which will do the work, and that the others will not do it at all.
The general idea is to perform a somehow-atomic and visible to all connections select/insert with node's ID. Only node which has the same id as the first inserted node ID returned by select would be do the work.
It was suggested to me that a merge statement can be of use here. However, after doing some research, I found a discussion which states that the merge statement is not designed to be called
Another option is to lock a table. By definition, only one node will be able to lock the server and do the insert, then select. After the lock is removed, other instances will see the inserted value and will not perform work.
What other solutions would you consider? I frown on workarounds with random delays, or even using oracle exceptions to notify a node that it should not do the work. I'd prefer a clean solution.
I ended up going with SELECT FOR UPDATE. It works as intended. It is important to remember to commit the transaction as soon as the needed update is made, so that other nodes don't hang waiting for the value.

One data store. Multiple processes. Will this SQL prevent race conditions?

I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.

Can I substitute savepoints for starting new transactions in Oracle?

Right now the process that we're using for inserting sets of records is something like this:
(and note that "set of records" means something like a person's record along with their addresses, phone numbers, or any other joined tables).
Start a transaction.
Insert a set of records that are related.
Commit if everything was successful, roll back otherwise.
Go back to step 1 for the next set of records.
Should we be doing something more like this?
Start a transaction at the beginning of the script
Start a save point for each set of records.
Insert a set of related records.
Roll back to the savepoint if there is an error, go on if everything is successful.
Commit the transaction at the beginning of the script.
After having some issues with ORA-01555 and reading a few Ask Tom articles (like this one), I'm thinking about trying out the second process. Of course, as Tom points out, starting a new transaction is something that should be defined by business needs. Is the second process worth trying out, or is it a bad idea?
A transaction should be a meaningful Unit Of Work. But what constitutes a Unit Of Work depends upon context. In an OLTP system a Unit Of Work would be a single Person, along with their address information, etc. But it sounds as if you are implementing some form of batch processing, which is loading lots of Persons.
If you are having problems with ORA-1555 it is almost certainly because you are have a long running query supplying data which is being updated by other transactions. Committing inside your loop contributes to the cyclical use of UNDO segments, and so will tend to increase the likelihood that the segments you are relying on to provide read consistency will have been reused. So, not doing that is probably a good idea.
Whether using SAVEPOINTs is the solution is a different matter. I'm not sure what advantage that would give you in your situation. As you are working with Oracle10g perhaps you should consider using bulk DML error logging instead.
Alternatively you might wish to rewrite the driving query so that it works with smaller chunks of data. Without knowing more about the specifics of your process I can't give specific advice. But in general, instead of opening one cursor for 10000 records it might be better to open it twenty times for 500 rows a pop. The other thing to consider is whether the insertion process can be made more efficient, say by using bulk collection and FORALL.
Some thoughts...
Seems to me one of the points of the asktom link was to size your rollback/undo appropriately to avoid the 1555's. Is there some reason this is not possible? As he points out, it's far cheaper to buy disk than it is to write/maintain code to handle getting around rollback limitations (although I had to do a double-take after reading the $250 pricetag for a 36Gb drive - that thread started in 2002! Good illustration of Moore's Law!)
This link (Burleson) shows one possible issue with savepoints.
Is your transaction in actuality steps 2,3, and 5 in your second scenario? If so, that's what I'd do - commit each transaction. Sounds a bit to me like scenario 1 is a collection of transactions rolled into one?

Resources