The scenario is this
We have two applications A and B, both which are running in separate database (Oracle 9i ) transactions
Application A - inserts some data into the database, then calls Application B
Application B - inserts some data into the database, related (via foreign keys) to A's data. Returns an "ID" to Application A
Application A - uses ID to insert further data, including the ID from B
Now, because these are separate transactions, but both rely on data from each others transactions, we need to commit between the calls to each application. This of course makes it very difficult to rollback if anything goes wrong.
How would you approach this problem, with minimal refactoring of the code. Surely this kind of this is a common problem in the SOA world?
------ Update --------
I have not been able to find anything in Oracle 9i, however Oracle 11g provides DBMS_XA, which does exactly what I was after.
You have three options:
Redesign the application so that you don't have two different processes (both with database connections) writing to the database and roll it into a single app.
Create application C that handles all the database transactions for A and B.
Roll your own two phase commit. Application C acts as the coordinator. C signals A and B to ask if they're ready to commit. A and B do their processing, and respond to C with either a "ready" or a "fail" reply (note that there should be a timeout on C to avoid an infinite wait if one process hangs or dies). If both reply ready then C tells them to commit. Otherwise it sends a rollback signal.
Note that you may run into issues with option 3 if app A is relying on foreign keys from app B (which you didn't state, so this may not be an issue). Oracle's read consistency would probably prevent this from being allowed, since app A's transaction will begin before app B. Just a warning.
A few suggestions:
Use Compensating transactions. Basically, you make it possible to undo the transaction you did earlier. The hard part is figuring out which transactions to rollback.
Commit the data of applications A and B to the database using a flag indicating that it is only temporary. Then, after everything checks out fine, modify the flag to indicate that the data is final. During the night, run a batch job to flush out data that has not been finalized.
You could probably insert the data from Application A into a 'temporary' area so that Application B can do the inserts of both A and B without changing much in either appplications. It's not particularly elegant but it might do the trick.
In another scenario you could add a 'confirmation' flag field to your data which is updated after the entire process has run successfully. It if fails at one point, it might be easier to track down the records you need to rollback (in effect, delete).
I like both solutions presented, so I avoided posting this for a while. But you could also make an update to the main table, having saved the state of the affected rows in some cache before hand.
This could be combined with the two-tier (The traffic cop system Zathrus proposed)--because it really wouldn't be needed for neonski's solution of using a "sketchpad" table or tables. The drawback of this is that you would have to have your procs/logic consult the main table from the workarea or the workarea from the main table--or perhaps store your flag in the main table and set it back when you commit the data to the main table.
A lady on our team is designing something like that for our realtime system, using permanent work tables.
App_A =={0}=> database # App_A stores information for App_B
App_A ------> App_B # App_A starts App_B
App_B <={0}== database # App_B retrieves the information
App_B =={1}=> database # App_B stores more informaion
App_A <={2}== App_B # App_B returns 'ID' to App_A
App_A ={2,3}> database # App_A stores 'ID' and additional data
Is it just me or does it seem like Application B is essentially just a subroutine of A. I mean Application B doesn't do anything until A asks it, and Application A doesn't do anything until B returns an ID. Which means it makes little sense to have them in different applications, or even separate threads.
Related
Lets star with background. I have an api endpoint that I have to query every 15 minutes and that returns complex data. Unfortunately this endpoint does not provide information of what exactly changed. So it requires me to compare the data that I have in db and compare everything and than execute update, add or delete. This is pretty boring...
I came to and idea that I can simply remove all data from certain tables and build everything from scratch... But it I have to also return this cached data to my clients. So there might be a situation that the db will be empty during some request from my client because it will be "refreshing/rebulding". And that cant happen because I have to return something
So I cam to and idea to
Lock the certain db tables so that the client will have to wait for the "refreshing the db"
or
CQRS https://martinfowler.com/bliki/CQRS.html
Do you have any suggestions how to solve the problem?
It sounds like you're using a relational database, so I'll try to outline a solution using database terms. The idea, however, is more general than that. In general, it's similar to Blue-Green deployment.
Have two data tables (or two databases, for that matter); one is active, and one is inactive.
When the software starts the update process, it can wipe the inactive table and write new data into it. During this process, the system keeps serving data from the active table.
Once the data update is entirely done, the system can begin to serve data from the previously inactive table. In other words, the inactive table becomes the active table, and vice versa.
I am considering using an Oracle database to synchronize concurrent operations from two or more web applications on separate servers. The database is the single infrastructure element in common for those applications.
There is a good chance that two or more applications will attempt to perform the same operation at the exact same moment (cron invoked). I want to use the database to let one application decide that it will be the one which will do the work, and that the others will not do it at all.
The general idea is to perform a somehow-atomic and visible to all connections select/insert with node's ID. Only node which has the same id as the first inserted node ID returned by select would be do the work.
It was suggested to me that a merge statement can be of use here. However, after doing some research, I found a discussion which states that the merge statement is not designed to be called
Another option is to lock a table. By definition, only one node will be able to lock the server and do the insert, then select. After the lock is removed, other instances will see the inserted value and will not perform work.
What other solutions would you consider? I frown on workarounds with random delays, or even using oracle exceptions to notify a node that it should not do the work. I'd prefer a clean solution.
I ended up going with SELECT FOR UPDATE. It works as intended. It is important to remember to commit the transaction as soon as the needed update is made, so that other nodes don't hang waiting for the value.
I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.
I have two PL/SQL systems, residing in two separate databases. SystemA will need to populate SystemB's tables. This will probably be done over a datalink. Everytime a set of records is inserted in SystemB's tables, a process in SystemB must run. I could wait for SystemA to complete and then run a script to start processing in SystemB, but since SystemA could spend many hours processing and then populating SystemB, I'd rather that SystemB handle each set of records as soon as they become available (each set can be processed indpendently of the others so this should work OK).
What I'm not sure of is how I can do even-driven programming in PL/SQL. I'd need SystemA to notify SystemB that a set is ready for processing. My first idea was to have a special "event" table in SystemB and then when SystemA finishes a set, it inserts into the "event" table and there is a trigger on insert that starts the process (and the process could be a long one, possibly 5-10 minutes per process) in SystemB. I don't have enough experience with triggers in Oracle to know if this is an established way of doing it, OR if there's a better mechanism. Suggestions? Tips? Advice?
Use Oracle Advanced Queuing; it's designed for this. I believe you'll still have to set up a database link between the two systems (from B to A in this case, to consume the queue on A).
Yes, Oracle Advance Queues or even having A submit a venerable Oracle Job to B would be a better idea.
And, if your process is going to be needing complete replication of the data from A to B, then you might want to look something like an Oracle Streams process to copy over the data and then do the processing.
Oracle has two seemingly competing technologies. CDC and DCN.
What are the strengths of each?
When would you use one and not the other?
In general, you would use DCN to notify a client application that the client application needs to clear/ update the application's cache. You would use CDC for ETL processing.
DCN would generally be preferable when you have an OLTP application that needs to be notified immediately about data changes in the database. Since the goal here is to minimize the number of network round-trips and the number of database hits, you'd generally want the application to use DCN for queries which either are mostly static. If a large fraction of the query is changing regularly, you may be better off just refreshing the application's cache on a set frequency rather than running queries constantly to get the changed data (DCN does not contain the changed data, just the ROWID of the row(s) that changed). If the application goes down, I believe DCN allows changes to be lost.
CDC would generally be preferable when you have a DSS application that needs to periodically pull over all the data that changed in a number of tables. CDC can guarantee that the subscriber has received every change to the underlying table(s) which can be important if you are trying to replicate changes to a different database . CDC allows the subscriber to pull the changes at its convenience rather than trying to notify the subscriber that there are changes, so you'd definitely want CDC if you wanted the subscriber to process new data every hour or every day rather than in near real time. (note: DCN also has a guaranteed delivery mode, see comments below. --Mark Harrison)
CDC seems to be much more complex to set up than DCN.
I mean to setup DCN I wrap a select in a start and end DCN block and then write a procedure to be called with a collect of changes. That's it.
CDC requires publishers and subscribers and anyways, seems like more work.