Large Resultset with Spring Boot and QueryDSL

Large Resultset with Spring Boot and QueryDSL - spring

I have a Spring Boot application where I use QueryDSL for dynamic queries.
Now the results should be exported as a csv file.
The model is an Order which contains products. The products should be included in the csv file.
However, as there are many thousand orders with millions of products this should not be loaded into memory at once.
However, solutions proposed by Hibernate (ScrollableResults) and streams are not supported by QueryDSL.
How can this be achieved while still using QueryDSL (to avoid duplication of filtering logic)?

One workaround to this problem is to keep iterating using offset and limit.
Something like:
long limit = 100;
long lastLimitUsed = 0;
List<MyEntity> entities = new JPAQuery<>(em)
.from(QMyEntity.entity)
.limit(limit)
.offset(lastLimitUsed)
.fetch();
lastLimitUsed += limit;
With that approach you can fetch smaller chunks of data. It is important to analyze if the limit and offset field will work well with your query. There are situations where even if you use limit and offset you will end up making a full scan on the tables involved on the query. If that happens you will face a performance problem instead of a memory one.

Use JPAQueryFactory
// com.querydsl.jpa.impl.JPAQueryFactory
JPAQueryFactory jpaFctory = new JPAQueryFactory(entityManager);
//
Expression<MyEntity> select = QMyEntity.myEntity;
EntityPath<MyEntity> path = QMyEntity.myEntity;
Stream stream = this.jpaQueryFactory
.select(select)
.from(entityPath)
.where(cond)
.createQuery() // get jpa query
.getResultStream();
// do something
stream.close();

Related

How do I update one column of all rows in a large table in my Spring Boot application?

I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.

Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.

Pagination in duplicate rescords

I need to apply pagination in a spring boot project.
I apply pagination in 2 queries. Each of them gives me data from different tables. Now, some of these records are identical in the two tables hence need to be removed.
At the end, the number of entries that I need to send will be reduced, thereby ruining the initial pagination applied. How do I go about this? What should be my approach?
Here I take 2 lists from 2 jpa calls(highRiskCust and amlPositiveCust) that will apply pagination, then remove the duplicacy and return the final result (tempReport).
`
List<L1ComplianceResponseDTO> highRiskCust = customerKyc.findAllHighRiskL1Returned(startDateTime, endDateTime,agentIds);
List<L1ComplianceResponseDTO> amlPositiveCust = customerKyc.findAllAmlPositiveL1Returned(startDateTime, endDateTime,agentIds);
List<L1ComplianceResponseDTO> tempReport = new ArrayList<>();
tempReport.addAll(amlPositiveCust);
tempReport.addAll(highRiskCust);
tempReport = tempReport.stream().filter(distinctByKey(p -> p.getKycTicketId()))
.collect(Collectors.toList());
`

In order to have pagination working, you need to do it with a unique request.
Fondammently this request should use UNION.
Since JPA does not support UNION, either you do a native query or you change you query logic using outer joins.

Getting max value on server (Entity Framework)

I'm using EF Core but I'm not really an expert with it, especially when it comes to details like querying tables in a performant manner...
So what I try to do is simply get the max-value of one column from a table with filtered data.
What I have so far is this:
protected override void ReadExistingDBEntry()
{
using Model.ResultContext db = new();
// Filter Tabledata to the Rows relevant to us. the whole Table may contain 0 rows or millions of them
IQueryable<Measurement> dbMeasuringsExisting = db.Measurements
.Where(meas => meas.MeasuringInstanceGuid == Globals.MeasProgInstance.Guid
&& meas.MachineId == DBMatchingItem.Id);
if (dbMeasuringsExisting.Any())
{
// the max value we're interested in. Still dbMeasuringsExisting could contain millions of rows
iMaxMessID = dbMeasuringsExisting.Max(meas => meas.MessID);
}
}
The equivalent SQL to what I want would be something like this.
select max(MessID)
from Measurement
where MeasuringInstanceGuid = Globals.MeasProgInstance.Guid
and MachineId = DBMatchingItem.Id;
While the above code works (it returns the correct value), I think it has a performance issue when the database table is getting larger, because the max filtering is done at the client-side after all rows are transferred, or am I wrong here?
How to do it better? I want the database server to filter my data. Of course I don't want any SQL script ;-)

This can be addressed by typing the return as nullable so that you do not get a returned error and then applying a default value for the int. Alternatively, you can just assign it to a nullable int. Note, the assumption here of an integer return type of the ID. The same principal would apply to a Guid as well.
int MaxMessID = dbMeasuringsExisting.Max(p => (int?)p.MessID) ?? 0;
There is no need for the Any() statement as that causes an additional trip to the database which is not desirable in this case.

Spring + Hibernate: Query Plan Cache Memory usage

I'm programming an application with the latest version of Spring Boot. I recently became problems with growing heap, that can not be garbage collected. The analysis of the heap with Eclipse MAT showed that, within one hour of running the application, the heap grew to 630MB and with Hibernate's SessionFactoryImpl using more than 75% of the whole heap.
Is was looking for possible sources around the Query Plan Cache, but the only thing I found was this, but that did not play out. The properties were set like this:
spring.jpa.properties.hibernate.query.plan_cache_max_soft_references=1024
spring.jpa.properties.hibernate.query.plan_cache_max_strong_references=64
The database queries are all generated by the Spring's Query magic, using repository interfaces like in this documentation. There are about 20 different queries generated with this technique. No other native SQL or HQL are used.
Sample:
#Transactional
public interface TrendingTopicRepository extends JpaRepository<TrendingTopic, Integer> {
List<TrendingTopic> findByNameAndSource(String name, String source);
List<TrendingTopic> findByDateBetween(Date dateStart, Date dateEnd);
Long countByDateBetweenAndName(Date dateStart, Date dateEnd, String name);
}
or
List<SomeObject> findByNameAndUrlIn(String name, Collection<String> urls);
as example for IN usage.
Question is: Why does the query plan cache keep growing (it does not stop, it ends in a full heap) and how to prevent this? Did anyone encounter a similar problem?
Versions:
Spring Boot 1.2.5
Hibernate 4.3.10

I've hit this issue as well. It basically boils down to having variable number of values in your IN clause and Hibernate trying to cache those query plans.
There are two great blog posts on this topic.
The first:
Using Hibernate 4.2 and MySQL in a project with an in-clause query
such as: select t from Thing t where t.id in (?)
Hibernate caches these parsed HQL queries. Specifically the Hibernate
SessionFactoryImpl has QueryPlanCache with queryPlanCache and
parameterMetadataCache. But this proved to be a problem when the
number of parameters for the in-clause is large and varies.
These caches grow for every distinct query. So this query with 6000
parameters is not the same as 6001.
The in-clause query is expanded to the number of parameters in the
collection. Metadata is included in the query plan for each parameter
in the query, including a generated name like x10_, x11_ , etc.
Imagine 4000 different variations in the number of in-clause parameter
counts, each of these with an average of 4000 parameters. The query
metadata for each parameter quickly adds up in memory, filling up the
heap, since it can't be garbage collected.
This continues until all different variations in the query parameter
count is cached or the JVM runs out of heap memory and starts throwing
java.lang.OutOfMemoryError: Java heap space.
Avoiding in-clauses is an option, as well as using a fixed collection
size for the parameter (or at least a smaller size).
For configuring the query plan cache max size, see the property
hibernate.query.plan_cache_max_size, defaulting to 2048 (easily too
large for queries with many parameters).
And second (also referenced from the first):
Hibernate internally uses a cache that maps HQL statements (as
strings) to query plans. The cache consists of a bounded map limited
by default to 2048 elements (configurable). All HQL queries are loaded
through this cache. In case of a miss, the entry is automatically
added to the cache. This makes it very susceptible to thrashing - a
scenario in which we constantly put new entries into the cache without
ever reusing them and thus preventing the cache from bringing any
performance gains (it even adds some cache management overhead). To
make things worse, it is hard to detect this situation by chance - you
have to explicitly profile the cache in order to notice that you have
a problem there. I will say a few words on how this could be done
later on.
So the cache thrashing results from new queries being generated at
high rates. This can be caused by a multitude of issues. The two most
common that I have seen are - bugs in hibernate which cause parameters
to be rendered in the JPQL statement instead of being passed as
parameters and the use of an "in" - clause.
Due to some obscure bugs in hibernate, there are situations when
parameters are not handled correctly and are rendered into the JPQL
query (as an example check out HHH-6280). If you have a query that is
affected by such defects and it is executed at high rates, it will
thrash your query plan cache because each JPQL query generated is
almost unique (containing IDs of your entities for example).
The second issue lays in the way that hibernate processes queries with
an "in" clause (e.g. give me all person entities whose company id
field is one of 1, 2, 10, 18). For each distinct number of parameters
in the "in"-clause, hibernate will produce a different query - e.g.
select x from Person x where x.company.id in (:id0_) for 1 parameter,
select x from Person x where x.company.id in (:id0_, :id1_) for 2
parameters and so on. All these queries are considered different, as
far as the query plan cache is concerned, resulting again in cache
thrashing. You could probably work around this issue by writing a
utility class to produce only certain number of parameters - e.g. 1,
10, 100, 200, 500, 1000. If you, for example, pass 22 parameters, it
will return a list of 100 elements with the 22 parameters included in
it and the remaining 78 parameters set to an impossible value (e.g. -1
for IDs used for foreign keys). I agree that this is an ugly hack but
could get the job done. As a result you will only have at most 6
unique queries in your cache and thus reduce thrashing.
So how do you find out that you have the issue? You could write some
additional code and expose metrics with the number of entries in the
cache e.g. over JMX, tune logging and analyze the logs, etc. If you do
not want to (or can not) modify the application, you could just dump
the heap and run this OQL query against it (e.g. using mat): SELECT l.query.toString() FROM INSTANCEOF org.hibernate.engine.query.spi.QueryPlanCache$HQLQueryPlanKey l. It
will output all queries currently located in any query plan cache on
your heap. It should be pretty easy to spot whether you are affected
by any of the aforementioned problems.
As far as the performance impact goes, it is hard to say as it depends
on too many factors. I have seen a very trivial query causing 10-20 ms
of overhead spent in creating a new HQL query plan. In general, if
there is a cache somewhere, there must be a good reason for that - a
miss is probably expensive so your should try to avoid misses as much
as possible. Last but not least, your database will have to handle
large amounts of unique SQL statements too - causing it to parse them
and maybe create different execution plans for every one of them.

I have same problems with many(>10000) parameters in IN-queries. The number of my parameters is always different and I can not predict this, my QueryCachePlan growing too fast.
For database systems supporting execution plan caching, there's a better chance of hitting the cache if the number of possible IN clause parameters lowers.
Fortunately Hibernate of version 5.2.18 and higher has a solution with padding of parameters in IN-clause.
Hibernate can expand the bind parameters to power-of-two: 4, 8, 16, 32, 64.
This way, an IN clause with 5, 6, or 7 bind parameters will use the 8 IN clause, therefore reusing its execution plan.
If you want to activate this feature, you need to set this property to true hibernate.query.in_clause_parameter_padding=true.
For more information see this article, atlassian.

I had the exact same problem using Spring Boot 1.5.7 with Spring Data (Hibernate) and the following config solved the problem (memory leak):
spring:
jpa:
properties:
hibernate:
query:
plan_cache_max_size: 64
plan_parameter_metadata_max_size: 32

Starting with Hibernate 5.2.12, you can specify a hibernate configuration property to change how literals are to be bound to the underlying JDBC prepared statements by using the following:
hibernate.criteria.literal_handling_mode=BIND
From the Java documentation, this configuration property has 3 settings
AUTO (default)
BIND - Increases the likelihood of jdbc statement caching using bind parameters.
INLINE - Inlines the values rather than using parameters (be careful of SQL injection).

I had a similar issue, the issue is because you are creating the query and not using the PreparedStatement. So what happens here is for each query with different parameters it creates an execution plan and caches it.
If you use a prepared statement then you should see a major improvement in the memory being used.

TL;DR: Try to replace the IN() queries with ANY() or eliminate them
Explanation:
If a query contains IN(...) then a plan is created for each amount of values inside IN(...), since the query is different each time.
So if you have IN('a','b','c') and IN ('a','b','c','d','e') - those are two different query strings/plans to cache. This answer tells more about it.
In case of ANY(...) a single (array) parameter can be passed, so the query string will remain the same and the prepared statement plan will be cached once (example given below).
Cause:
This line might cause the issue:
List<SomeObject> findByNameAndUrlIn(String name, Collection<String> urls);
as under the hood it generates different IN() queries for every amount of values in "urls" collection.
Warning:
You may have IN() query without writing it and even without knowing about it.
ORM's such as Hibernate may generate them in the background - sometimes in unexpected places and sometimes in a non-optimal ways.
So consider enabling query logs to see the actual queries you have.
Fix:
Here is a (pseudo)code that may fix issue:
query = "SELECT * FROM trending_topic t WHERE t.name=? AND t.url=?"
PreparedStatement preparedStatement = connection.prepareStatement(queryTemplate);
currentPreparedStatement.setString(1, name); // safely replace first query parameter with name
currentPreparedStatement.setArray(2, connection.createArrayOf("text", urls.toArray())); // replace 2nd parameter with array of texts, like "=ANY(ARRAY['aaa','bbb'])"
But:
Don't take any solution as a ready-to-use answer. Make sure to test the final performance on actual/big data before going to production - no matter which answer you choose.
Why? Because IN and ANY both have pros and cons, and they can bring serious performance issues if used improperly (see examples in references below). Also make sure to use parameter binding to avoid security issues as well.
References:
100x faster Postgres performance by changing 1 line - performance of Any(ARRAY[]) vs ANY(VALUES())
Index not used with =any() but used with in - different performance of IN and ANY
Understanding SQL Server query plan cache
Hope this helps. Make sure to leave a feedback whether it worked or not - in order to help people like you. Thanks!

I had a big issue with this queryPlanCache, so I did a Hibernate cache monitor to see the queries in the queryPlanCache.
I am using in QA environment as a Spring task each 5 minutes.
I found which IN queries I had to change to solve my cache problem.
A detail is: I am using Hibernate 4.2.18 and I don't know if will be useful with other versions.
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;
import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;
import org.hibernate.ejb.HibernateEntityManagerFactory;
import org.hibernate.internal.SessionFactoryImpl;
import org.hibernate.internal.util.collections.BoundedConcurrentHashMap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.dao.GenericDAO;
public class CacheMonitor {
private final Logger logger = LoggerFactory.getLogger(getClass());
#PersistenceContext(unitName = "MyPU")
private void setEntityManager(EntityManager entityManager) {
HibernateEntityManagerFactory hemf = (HibernateEntityManagerFactory) entityManager.getEntityManagerFactory();
sessionFactory = (SessionFactoryImpl) hemf.getSessionFactory();
fillQueryMaps();
}
private SessionFactoryImpl sessionFactory;
private BoundedConcurrentHashMap queryPlanCache;
private BoundedConcurrentHashMap parameterMetadataCache;
/*
* I tried to use a MAP and use compare compareToIgnoreCase.
* But remember this is causing memory leak. Doing this
* you will explode the memory faster that it already was.
*/
public void log() {
if (!logger.isDebugEnabled()) {
return;
}
if (queryPlanCache != null) {
long cacheSize = queryPlanCache.size();
logger.debug(String.format("QueryPlanCache size is :%s ", Long.toString(cacheSize)));
for (Object key : queryPlanCache.keySet()) {
int filterKeysSize = 0;
// QueryPlanCache.HQLQueryPlanKey (Inner Class)
Object queryValue = getValueByField(key, "query", false);
if (queryValue == null) {
// NativeSQLQuerySpecification
queryValue = getValueByField(key, "queryString");
filterKeysSize = ((Set) getValueByField(key, "querySpaces")).size();
if (queryValue != null) {
writeLog(queryValue, filterKeysSize, false);
}
} else {
filterKeysSize = ((Set) getValueByField(key, "filterKeys")).size();
writeLog(queryValue, filterKeysSize, true);
}
}
}
if (parameterMetadataCache != null) {
long cacheSize = parameterMetadataCache.size();
logger.debug(String.format("ParameterMetadataCache size is :%s ", Long.toString(cacheSize)));
for (Object key : parameterMetadataCache.keySet()) {
logger.debug("Query:{}", key);
}
}
}
private void writeLog(Object query, Integer size, boolean b) {
if (query == null || query.toString().trim().isEmpty()) {
return;
}
StringBuilder builder = new StringBuilder();
builder.append(b == true ? "JPQL " : "NATIVE ");
builder.append("filterKeysSize").append(":").append(size);
builder.append("\n").append(query).append("\n");
logger.debug(builder.toString());
}
private void fillQueryMaps() {
Field queryPlanCacheSessionField = null;
Field queryPlanCacheField = null;
Field parameterMetadataCacheField = null;
try {
queryPlanCacheSessionField = searchField(sessionFactory.getClass(), "queryPlanCache");
queryPlanCacheSessionField.setAccessible(true);
queryPlanCacheField = searchField(queryPlanCacheSessionField.get(sessionFactory).getClass(), "queryPlanCache");
queryPlanCacheField.setAccessible(true);
parameterMetadataCacheField = searchField(queryPlanCacheSessionField.get(sessionFactory).getClass(), "parameterMetadataCache");
parameterMetadataCacheField.setAccessible(true);
queryPlanCache = (BoundedConcurrentHashMap) queryPlanCacheField.get(queryPlanCacheSessionField.get(sessionFactory));
parameterMetadataCache = (BoundedConcurrentHashMap) parameterMetadataCacheField.get(queryPlanCacheSessionField.get(sessionFactory));
} catch (Exception e) {
logger.error("Failed fillQueryMaps", e);
} finally {
queryPlanCacheSessionField.setAccessible(false);
queryPlanCacheField.setAccessible(false);
parameterMetadataCacheField.setAccessible(false);
}
}
private <T> T getValueByField(Object toBeSearched, String fieldName) {
return getValueByField(toBeSearched, fieldName, true);
}
#SuppressWarnings("unchecked")
private <T> T getValueByField(Object toBeSearched, String fieldName, boolean logErro) {
Boolean accessible = null;
Field f = null;
try {
f = searchField(toBeSearched.getClass(), fieldName, logErro);
accessible = f.isAccessible();
f.setAccessible(true);
return (T) f.get(toBeSearched);
} catch (Exception e) {
if (logErro) {
logger.error("Field: {} error trying to get for: {}", fieldName, toBeSearched.getClass().getName());
}
return null;
} finally {
if (accessible != null) {
f.setAccessible(accessible);
}
}
}
private Field searchField(Class<?> type, String fieldName) {
return searchField(type, fieldName, true);
}
private Field searchField(Class<?> type, String fieldName, boolean log) {
List<Field> fields = new ArrayList<Field>();
for (Class<?> c = type; c != null; c = c.getSuperclass()) {
fields.addAll(Arrays.asList(c.getDeclaredFields()));
for (Field f : c.getDeclaredFields()) {
if (fieldName.equals(f.getName())) {
return f;
}
}
}
if (log) {
logger.warn("Field: {} not found for type: {}", fieldName, type.getName());
}
return null;
}
}

We also had a QueryPlanCache with growing heap usage. We had IN-queries which we rewrote, and additionally we have queries which use custom types. Turned out that the Hibernate class CustomType didn't properly implement equals and hashCode thereby creating a new key for every query instance. This is now solved in Hibernate 5.3.
See https://hibernate.atlassian.net/browse/HHH-12463.
You still need to properly implement equals/hashCode in your userTypes to make it work properly.

We had faced this issue with query plan cache growing too fast and old gen heap was also growing along with it as gc was unable to collect it.The culprit was JPA query taking some more than 200000 ids in the IN clause. To optimise the query we used joins instead of fetching ids from one table and passing those in other table select query..

Hibernate Delete object by id

Which is the best method(performance wise) to delete an object if only its id is available.
HQL. Will executing this HQL load the SessionContext object into hibernate persistence context ?
for(int i=0; i<listOfIds.size(); i++){
Query q = createQuery("delete from session_context where id = :id ");
q.setLong("id", id);
q.executeUpdate();
}
Load by ID and delete.
for(int i=0; i<listOfIds.size(); i++){
SessionContext session_context = (SessionContext)getHibernateTemplate().load(SessionContext.class, listOfIds.get(i));
getHibernateTemplate().delete(session_context) ;
}
Here SessionContext is the object mapped to session_context table.
Or, well off course is there an all together different and better approach ?

Out of the two, the first one is better, where you will save memory. When you want to delete the Entity and you have the id with you, writing a HQL is preferred.
In you case there is a third and better option,
Try the below,
//constructs the list of ids using String buffer, by iterating the List.
String idList = "1,2,3,....."
Query q = createQuery("delete from session_context where id in (:idList) ");
q.setString("idList", idList);
q.executeUpdate();
Now if there are 4 items in the list only one query will be fired, Previously there would be 4.
Note:- For the above to work, session_context should be an independent table.

Btw, say no to that ugly string, there is .setParameterList(), so:
List<Long> idList = Arrays.asList(1L, 2L, 3L);
Query q = createQuery("delete from session_context where id in (:idList) ");
q.setParameterList("idList", idList);
q.executeUpdate();
update
I must update on this, in our environment, in the end it turned out, that using setParameterList gives much worse performance, than creating a string manualy and using setString like #ManuPK suggested.

You should also consider caching - first level (session) and second level cache.
The first option is probably the best if the delete is the only or the first operation in transaction.
If you query for some SessionContext objects then call the HQL to delete then all objects in query cache will be evicted, because hibernate doesn't know which to delete. This is not the case with the second approach.
If you use second level cache then it is even more complicated and highly depends on what you do with SessionContext objects.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio