Spring Data Query Execution Optimization: Parallel Execution of Hibernate #Query Method in JpaRepository - spring

I have a Dashboard view, which requires small sets of data from tables all over the database. I optimized the database queries (e.g. removed sub-queries). There are now ~20 queries which are executed one after the other, and which are fetching different data sets from the database. Most of the HQL queries contain GROUP BY and JOIN clauses. With a Spring REST interface, the result is returned to the front-end.
How do I optimize the execution of the custom queries? My initial thought was to run the database queries in parallel. But how do I achieve that? After doing some research I found the annotation #Async which makes it possible to run methods in parallel. But does this work with Hibernate methods? Is there always a new database session created for every method annotated with #Query in a JpaRepository? Does running a database query have an effect on the overall execution time after all?
Another way to run the database calls in parallel, is splitting the Dashboard call into several single Ajax calls (every concern gets its own Ajax call). I didn't want to do that, because every time the dashboard is opened (or e.g. the date range is changed), another 20 Ajax calls are made to fetch the new data. And the same question remains: Does running SQL queries in parallel have an effect on the execution time of the database?
I currently did not yet add additional indices to the database. This will be the next thing, I definitely will be doing. However, I'm interested on the performance impacts of running the queries in parallel and on how to achieve this programmatically with Spring.
My project was initially generated by jHipster (Spring Boot, MariaDB, AngularJS etc.)

First, running these SQLs in parallel will not impact the database and it will only make the page load faster, so the design should focus on that.
I am posting this answer assuming that you have already made sure that you cannot combine these 20 SQLs because the data is unrelated (no joins, views, etc).
I would advise against using #Async for 2 reasons.
Reason 1 - An asynchronous task is great when you want to fire a bunch of tasks and forget, or when you know when all the tasks will be complete. So you will need to "wait" for all your asynchronous tasks to complete. How long should you wait? Until the slowest query is done?
Check this sample code for Async (from the guides # spring.io --https://spring.io/guides/gs/async-method/)
// Wait until they are all done
while (!(page1.isDone() && page2.isDone() && page3.isDone())) {
Thread.sleep(10); //10-millisecond pause between each check
}
Will/should your service component wait on 20 Async DAO queries?
Reason 2 - Remember that Async is just spawning off the task as a thread. Since you are going to work with JPA, remember Entity managers are not thread-safe. And DAO classes will propagate transactions. Here is an example of problems that may crop up - http://alexgaddie.blogspot.com/2011/04/spring-3-async-with-hibernate-and.html
IMHO, it is better to go ahead with multiple Ajax calls, because that will make your components cohesive. Yes, you will have 20 endpoints, but they would have a simpler DAO, simpler SQL, easily unit testable and the returned data structure will be easier to handle/parse by the AngularJS widgets. When the UI triggers all 20 Ajax calls, the dashboard would be loading individual widgets when they are ready, instead of loading all of them at the same time. This will help you extend your design in future by optimizing the slower loading sections of your dashboard (maybe caching, indexing, etc).
Bunching your DAO calls will only make the data structure complex and unit testing harder.

Normally it will be much faster to execute the queries in parallel. If you are using Spring data and do not configure anything specific your JPA provider (Hibernate) will create a connection pool that stores connections to your data base. I think by default Hibernate holds 10 connections and by doing so it is prepared to do 10 queries in parallel. How much faster the queries are by running them in parallel depends on the database and the structure of the tables / queries.
I think that using #Async is not the best practice here. Defining 20 REST endpoints that provides the result of a specific query is a much better approach. By doing so you can simple create the Entity, Repository and RestEndpoint class for each query. By doing so each query is isolated and the code is less complex.

Related

What's the performance penalty of long lived DB transactions interleaved with one another?

Could anyone provide an explanation or point me to a good source where it is explained the impact of long lived database transactions when there are other transactions involved?
I'm having difficulties trying to understand what is the real impact in the performance of an application of having transactions where most of the queries are reads and maybe a couple or three are writes, given the different isolation levels.
Mostly I would like to understand it in the situation where:
Neither the rows read nor the rows updated are involved in any other transaction.
The rows read are involved in another transaction but not the rows being updated and this other transaction is read only.
The rows read are involved in another transaction but not the rows being updated and this other transaction is modifying some data being read. I understand here it also affects whether the data is read before or after is being modified.
Both the rows read and the rows updated are involved in another transaction also modifying the data.
These questions come in the context of an application using micro services where all application layer services are annotated with #Transactional using JPA and PostgreSQL and, to transform the data, they need to do some network calls to other micro services within the transaction to fetch some other values.

GraphQL and nested resources would make unnecessary calls?

I read GraphQL specs and could not find a way to avoid 1 + N * number_of_nested calls, am I missing something?
i.e. a query has a type client which has nested orders and addresses, if there are 10 clients it will do 1 call for the 10 clients + 10 calls for each client.orders + 10 calls for each client.addresses.
Is there a way to avoid this? Not that it is not the same as caching an UUID of something, those are all different values and if you GraphQL points to a database which can make joins, it would be pretty bad on it because you could do 3 queries for any number of clients.
I ask this because I wanted to integrate GraphQL with an API that can fetch nested resources in an efficient way and if there was a way to solve the whole graph before resolving it would be nice to try to put some nested stuff in just one call.
Or I got it wrong and GraphQL is meant to be used only with microservices?
This is one of the difficulties of GraphQL's "resolver architecture". You must avoid incurring a ton of network latency by doing a lot of I/O in each resolver. Apps using a SQL DBMS will often grapple with the N + 1 problem at first. You need to use some batching and/or caching techniques to get around this.
If you are using Node.js on the server, I have two tools to recommend:
DataLoader - A database-agnostic tool for batching resolvers for each field and caching individual records.
Join Monster - A SQL-tailored tool that reads each query and your schema and compiles a SQL query for you. It leverages JOINs and DataLoader-style batching to fetch the data from your tables in a few (or a single) SQL queries.
I consider, that you're talking about using GraphQL with SQL database backend. The standard itself is database agnostic, and it doesn't care, how are you going to work out the problems of possible N+1 SELECT issues in your code. That being said, the specific server-side implementations of GraphQL server introduce many different ways of mitigating that problem:
AFAIK, Ruby implementation is able to to make use of Active Record and gems such as bullet to apply horizontal batching of executed database calls.
JavaScript implementation may make use of DataLoader library, which have similar techinque of batching series of executed promises together. You can see it in action here.
Elixir and Python implementations have concept of runtime info about executed subqueries, that can be used to determine which data will be further needed in order to execute GraphQL query, and potentially prefetch it.
F# implementation works similar to Elixir, but plugin itself can perform live analysis of execution tree to better describe, which fields can be potentially used in code, allowing for easier split of GraphQL domain model from database model.
Many implementations (i.e. PostGraph) tie underlying database model directly into GraphQL schema. In this case GQL query is often translated directly into database query language.

neo4j slows down after lots of inserts

I'm the owner of the Blockchain2graph project that reads data from Bitcoin core rest API and insert Blocks, Addresses and Transactions as Graph objects in Neo4j.
After some imports, the process is slowing down until the memory is full. I don't want to use CSV imports. My problem is not performance, my goal is to insert things without the application stopping because of memory (even if it takes quite a lot of time)
I'm using spring-boot-starter-data-neo4j.
In my code, I try to make session.clear from times to times but it doesn't seem to have an impact. After restarting tomcat8, things go fast again.
As your project is about mass inserts, I wouldn't use an OGM like Spring Data Neo4j for writing the data.
You don't want a session to keep your data around on the client.
Instead, use Cypher directly sending updates you get from the BlockChain API directly as a batch per request, see my blog post for some examples (some of which we also use in SDN/Neo4j-OGM under the hood).
You can still use SDN for individual entity handling (CRUD) that's what OGMs are good for in my book to reduce the boilerplate.
But for more complex read operations that have aggregation, filtering, projection and path matches I'd still use Cypher on an annotated repository method, returning rows that can be mapped to a list of DTOs.

Spring Batch Framework

I am not able to finalize whether Spring Batch framework is applicable for the below requirement. I need experts inputs on this.
Following is my requirement:
Read multiple Oracle tables (at least 10 tables including both transaction and master), do complex
calculation based on the business rules, Insert / Update / Delete
records in transaction tables.
I have identified the following two designs:
Design # 1:
ItemReader: Select eligible records from Key transaction table.
ItemProcessor: Fetch additional details from DB using the key available in the record retrieved by ItemReader.(It would require multipble DB transactions)
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can achieve parallel processing but increase the number of DB transactions.
Design # 2:
Step # 1
ItemReader: Use Composite Item Reader (Group of ItemReaders) to read all the required tables.
ItemWriter: Save the result sets as lists of Objects (One list per table) in execution context
Step # 2
ItemReader: Retrieve lists of Objects available in execution context and group them into one list of objects based on the business processing so that processor can process them.
IremProcessor:
Process the chunk of Objects returned by ItemReader.
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can REDUCE the number of DB Transactions but we are delaying the processing till all table records are retrieved and stored in execution context ie we are not using parallel processing provided by SpringBatch.
Please advise whether the above is feasible using SpringBatch or we need to use conventional Java program.
The good news is that your problem description matches a very common use case for spring-batch. The bad news is that the problem description is too generic to allow much meaningful input about the specifc design beyond the comments already provided.
Spring-batch brings facilities similar to JCL and ISPF from the mainframe world into the java context.
Spring batch provides a framework for organizing and managing the boundaries of your process. It is a natural for a lot of ETL and bigdata operations, but it is not the only way to write these processes.
If you process can be broken down into discreet steps, then spring batch is a good choice for you.
The Itemreader should (logicall) be an iterator returning a single object representing the start of one logical unit of work (luw). The luw object is captured by the chunker and assembled into collections of the size you configure, and then passed to the processor. The result of the processor is then passed to the writer. In the context of an RDBMS centric process, the commit happens at the end of the writer's operation.
What happens in each of those pieces of the step is 100% whatever you need (plain old java). The point of the framework is to free you from the complexity and enable you to solve the problem.
From my understanding, Spring batch has nothing to do with database batch operations (or at least the word 'batch' has a different meaning in these two contexts..) Spring batch is used to create processes with multiple steps, and gives you the chance to restart a process if one of the process steps fails (without repeating the previously finished process steps.)

Is Hibernate good for batch processing? What about memory usage?

I have a daily batch process that involves selecting out a large number of records and formatting up a file to send to an external system. I also need to mark these records as sent so they are not transmitted again tomorrow.
In my naive JDBC way, I would prepare and execute a statement and then begin to loop through the recordset. As I only go forwards through the recordset there is no need for my application server to hold the whole result set in memory at one time. Groups of records can be feed across from the database server.
Now, lets say I'm using hibernate. Won't I endup with a bunch of objects representing the whole result set in memory at once?
Hibernate does also iterate over the result set so only one row is kept in memory. This is the default. If it to load greedily, you must tell it so.
Reasons to use Hibernate:
"Someone" was "creative" with the column names (PRXFC0315.XXFZZCC12)
The DB design is still in flux and/or you want one place where column names are mapped to Java.
You're using Hibernate anyway
You have complex queries and you're not fluent in SQL
Reasons not to use Hibernate:
The rest of your app is pure JDBC
You don't need any of the power of Hibernate
You have complex queries and you're fluent in SQL
You need a specific feature of your DB to make the SQL perform
Hibernate offers some possibilities to keep the session small.
You can use Query.scroll(), Criteria.scroll() for JDBC-like scrolling. You can use Session.evict(Object entity) to remove entities from the session. You can use a StatelessSession to suppress dirty-checking. And there are some more performance optimizations, see the Hibernate documentation.
Hibernate as any ORM framework is intended for developing and maintaining systems based on object oriented programming principal. But most of the databases are relational and not object oriented, so in any case ORM is always a trade off between convenient OOP programming and optimized/most effective DB access.
I wouldn't use ORM for specific isolated tasks, but rather as an overall architectural choice for application persistence layer.
In my opinion I would NOT use Hibernate, since it makes your application a whole lot bigger and less maintainable and you do not really have a chance of optimizing the generated sql-scripts in a quick way.
Furthermore you could use all the SQL functionality the JDBC-bridge supports and are not limited to the hibernate functionality. Another thing is that you have the limitations too that come along with each layer of legacy code.
But in the end it is a philosophical question and you should do it the way it fits you're way of thinking best.
If there are possible performance issues then stick with the JDBC code.
There are a number of well known pure SQL optimisations which
which would be very difficult to do in Hibernate.
Only select the columns you use! (No "select *" stuff ).
Keep the SQl as simple as possible. e.g. Dont include small reference tables like currency codes in the join. Instead load the currency table into memory and resolve currency descriptions with a program lookup.
Depending on the DBMS minor re-ordering of the SQL where predicates can have a major effect on performance.
If you are updateing/inserting only commit every 100 to 1000 updates. i.e. Do not commit every unit of work but keep some counter so you commit less often.
Take advantage of the aggregate functions of your database. If you want totals by DEPT code then do it in the SQL with " SUM(amount) ... GROUP BY DEPT ".

Resources