Spring jpa performance the smart way - spring-boot

I have a service that listens to multiple queues and saves the data to a database.
One queue gives me a person.
Now if I code it really simple. I just get one message from the queue at a time.
I do the following
Start transaction
Select from person table to check if it exists.
Either update existing or create a new entity
repository.save(entity)
End transaction
The above is clean and robust. But I get alot of messages its not fast enough.
To improve performance I have done this.
Fetch 100 messages from queue
then
Start transaction
Select all persons where id in (...) in one query using ids from incomming persons
Iterate messages and for each one check if it was selected above. If yes then update it if not then create a new
Save all changes with batch update/create
End transaction
If its a simple message the above is really good. It performs. But if the message is complicated or the logic I should do when I get the message is then the above is not so good since there is a change some of the messages will result in a rollback and the code becomes hard to read.
Any ideas on how to make it run fast in a smarter way?

Why do you need to rollback? Can't you just not execute whatever it is that then has to be rolled back?
IMO the smartest solution would be to code this with a single "upsert" statement. Not sure which database you use, but PostgreSQL for example has the ON CONFLICT clause for inserts that can be used to do updates if the row already exists. You could even configure Hibernate to use that on insert by using the #SQLInsert annotation.

Related

How best to handle Constraint Violations in Spring JDBC

What is considered the best/most correct way to handle a Constraint Violation in Spring JDBC?
As a real example. I've got a users table that contains, amongst other things, columns of url_slug and email. Both of these are UNIQUE. When creating a new record, or updating an existing record, if I try to make one record contain a duplicate value of another record in this table I want to return a sensible error back to the caller.
The only options I can think of are both flawed. I'm doing this in Postgres 10, using Spring NamedParameterJdbcTemplate in case that matters at all.
Check the data before doing the INSERT/UPDATE.
This will involve an extra query on every Insert/Update call, and has a race condition that means it might still not catch it. i.e.
Thread 1 starts transaction
Thread 2 starts transaction
Thread 1 queries data
Thread 2 queries data
Thread 1 does update
Thread 2 does update
Thread 1 does commit
Thread 2 does commit <-- Constraint Violation, even though at #4 the data was fine
Handle the DuplicateKeyException.
The problem here is that it's not thrown until the Transaction is committed, at which point it might well be unclear exactly which SQL call failed, which constraint failed, or anything else like that.
There is no "best" way to handle these kind of exceptions, except putting it into a try catch block and propagate back the error message to the user.
Of course in your example the problem is that you most probably don't want to use Serializable isolation level which essentially executes every transaction one-by-one, making sure that this cannot happen. Another way would be to lock the table for the entire transaction, but I wouldn't advise that either.
Simply put your transactional call into a try-catch block and handle it as you want.

Using HIbernate / Spring whats the best way to watch a table for changes to individual records?

Q: What is the proper way to watch a table for record level changes using Hibernate / Spring? The DB is a typical relational database system. Our intent is to move to an in-memory solution some time in the future but we can't do it just yet. Q: Are we on the right track or is there a better approach? Examples?
We've thought of two possibilities. One is to load and cache the whole table and the other is to implement a hibernate event listener. Problem is that we aren't interested in events originating in the current VM. What we are interested in is if someone else changes the table. If we load and cache the entire table we'll still have to figure out an efficient way to know when it changes so we may end up implementing both a cache and a listener. Of course a listener might not help us if it doesn't hear changes external to the VM. Our interest is in individual records which is to say that if a record changes, we want Java to update something else based on that record. Ideally we want to avoid re-loading the entire cache, assuming we use one, from scratch and instead update specific records in the cache as they change.

best way to deal with multiple transaction to the database with entity framework

Bit of advice really, i am building an MVC application that takes in feeds for products from multiple sources. This can run into millions and despite my best advice for the client to split all his feeds into smaller chunks, I know they will probably try and do a thousand at a go.
Now the main problem is that I don't want to loop through every xml record and do an insert.
what i would rather do is queue a stack off inserts and then fly them into the database in one massive transaction. Very much like a database SQL import of a whole table.
Is this possible? if so how or what do they call it?
also, if I did want to re-insert repeated products again and again, when nothing has changed, what would be the best practice for this. could I maybe loop through an already fetched dataset?
I'm not sure what is best to do here, so ask the people, what is the consensus when it comes to a scenario like this.
thanks
With the entity framework you will get a single db insert per record you are inserting, there will be no bulk insert (if that is what you were looking for).
However to enclose this in a transaction, you need to do nothing but add your item to the context class.
http://msdn.microsoft.com/en-us/library/bb336792.aspx
This will automatically put in a transaction when you call SaveChanges. All you need to do is ensure you use a single context class and .Add(yourObject) to the context.
So just wait to call SaveChanges until all of the objects have been added to the context.

One data store. Multiple processes. Will this SQL prevent race conditions?

I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.

Can I substitute savepoints for starting new transactions in Oracle?

Right now the process that we're using for inserting sets of records is something like this:
(and note that "set of records" means something like a person's record along with their addresses, phone numbers, or any other joined tables).
Start a transaction.
Insert a set of records that are related.
Commit if everything was successful, roll back otherwise.
Go back to step 1 for the next set of records.
Should we be doing something more like this?
Start a transaction at the beginning of the script
Start a save point for each set of records.
Insert a set of related records.
Roll back to the savepoint if there is an error, go on if everything is successful.
Commit the transaction at the beginning of the script.
After having some issues with ORA-01555 and reading a few Ask Tom articles (like this one), I'm thinking about trying out the second process. Of course, as Tom points out, starting a new transaction is something that should be defined by business needs. Is the second process worth trying out, or is it a bad idea?
A transaction should be a meaningful Unit Of Work. But what constitutes a Unit Of Work depends upon context. In an OLTP system a Unit Of Work would be a single Person, along with their address information, etc. But it sounds as if you are implementing some form of batch processing, which is loading lots of Persons.
If you are having problems with ORA-1555 it is almost certainly because you are have a long running query supplying data which is being updated by other transactions. Committing inside your loop contributes to the cyclical use of UNDO segments, and so will tend to increase the likelihood that the segments you are relying on to provide read consistency will have been reused. So, not doing that is probably a good idea.
Whether using SAVEPOINTs is the solution is a different matter. I'm not sure what advantage that would give you in your situation. As you are working with Oracle10g perhaps you should consider using bulk DML error logging instead.
Alternatively you might wish to rewrite the driving query so that it works with smaller chunks of data. Without knowing more about the specifics of your process I can't give specific advice. But in general, instead of opening one cursor for 10000 records it might be better to open it twenty times for 500 rows a pop. The other thing to consider is whether the insertion process can be made more efficient, say by using bulk collection and FORALL.
Some thoughts...
Seems to me one of the points of the asktom link was to size your rollback/undo appropriately to avoid the 1555's. Is there some reason this is not possible? As he points out, it's far cheaper to buy disk than it is to write/maintain code to handle getting around rollback limitations (although I had to do a double-take after reading the $250 pricetag for a 36Gb drive - that thread started in 2002! Good illustration of Moore's Law!)
This link (Burleson) shows one possible issue with savepoints.
Is your transaction in actuality steps 2,3, and 5 in your second scenario? If so, that's what I'd do - commit each transaction. Sounds a bit to me like scenario 1 is a collection of transactions rolled into one?

Resources