Spring Data JPA garbage collection - spring-boot

I have a Spring Batch application with JpaPagingItemReader (i modified it a bit) and 4 Jpa repositories to enrich Model which comes from JpaPagingItemReader.
My flow is:
Select Model (page size = 8192), then i collect this List<Model> to Map<String, List<Model>> (group by id, because models not unique and i need to enrich by id) then enrich it with 4 custom JpaRepositories with native queries with IN clause, and merge them with Java 8 Streams.
Convert data to XML object and with Stax writing with MultiFileItemWriter to files, which are splitted no more than 20000 per file.
All works great, but today i tried to run flow with big amount of data from database. I generated 20 files (2.2 GB). But sometimes i got OutOfMemory Java Heap (I had 1Gb XMS, XSS), then i up it to 2 GB and all works good, but in Instana i see, that Old gen Java memory is always 900 in use after GC. It is about 1.3-1.7Gb in use. So i start to think, how can i optimize GC of Spring Data Jpa objects. I think they are much time in memory.
When i select Model with JpaPagingItemReader i detach every Model (with entityManager.detach), but when i enrich Model with custom Spring Data Jpa requests i am not detaching results. Maybe the problem in this and i should detach them?
I do not need to insert data to database, i need just to read it. Or do i need to make page size less and select about 4000 per request?
I need to process 370 000 records from database and enrich them.

Solved. Added flags to my run configuration, and increase XMS and XMX twice.

Related

How to use spring data redis pipeline to process 1 million records?

I have a requirement where I have to read a file which contains around 1 million records each record in a separate line. Each record will be validated and then will be saved in Redis cache. I implemented this in a normal traditional way by reading each line and saving it in the Redis cache but it is hampering performance very badly.
Then I came to know about Redis pipeline feature in which I will process records in a batch say 10k at a time in order to improve performance.
How can I use this feature in my current scenario? Any small simple
example appreciated. I am using Redis-2.1.1.RELEASE, spring
boot-2.0.8.RELEASE

Spring JPA taking too much time & memory for inserting data in postgres DB

I am working on a spring batch app (with 2 GB memory) and trying to process data (used select queries to get data while processing) and inserting about 1 million processed records in postgres DB. I am using Spring Data JPA for this project. But Spring JPA is consuming too much memory while processing these records & finally i got out of memory exception. I suspected that there are too many entities created
which are not been cleared. Hence I tried to clear entityManager after certain DB calls, but didnt help. How can i reduce the memory consumption by JPA? Any suggestion to reduce memory consumption will be highly appreciated.
Possible Reasons
Number of http threads (Undertow starts around 50 threads per
default, but you can increase/decrease via property the number of
threads needed)
Static variables
Use of cache (memcache, ehcache, etc)
Cascade Persist
Batch Writing
server.tomcat.max-threads=5
This will limit the number of HTTP request handler threads to 1 (default is 50)
in your application.properties
For Details, you should refer R1 R2 R3

Hibernate large collections performance problems despite using 2nd Level Cache

We have a parent object with a collection of 500.000 child objects. We are using Hibernate for mapping with ehcache as the cache provider. Using the 2nd level cache for entities and collection works fine as we can avoid requests to the database.
But loading 500.000 objects by 2nd level cache still produces a lot of cpu and memory garbage and results in a reponse time of a few seconds. As the child objects are not immutable we can't enable the hibernate.cache.use_reference_entries property.
With using an application layer cache of dao objects in top of the hibernate 2nd level cache, there's no cpu and no garbage memory overhead. The response time is a few milliseconds instead of seconds.
But the big disadvantage of this solution is, that we have to manage this cache by ourself. Including invalidation und synchronization in a clustered multithreading system.
My question is, if there's a better solution with the advantages of low cpu and garbage? Does anyone have experience in handling large collections?
Do you really need that 500k at once?
You could remove the collection from the Parent and query the objects from Child by parent: SELECT c FROM Child c WHERE c.parent = :parent and add pagination or filtering when you dont need the 500k at once.
You could also load the data Child entieties as DTO which would improve memory performance because hibernate would not consider these DTO's for dirty checking. I guess this would remove memory footprint by half, although i never benchmarked it. Also a DTO would allow you to omit attributes which you dont need in this certain use case saving memory and CPU.
You could also take a look at enableDirtyTracking in Hibernate 5.

Paging SELECT query results from Cassandra in Spring Boot application

During my research I have come across this JIRA for Spring-Data-Cassandra:
https://jira.spring.io/browse/DATACASS-56
Now, according to the post above, currently SDC is not supporting Pagination in the Spring App due to structure of Cassandra. However, I'm thinking, if I can pull the entire rows list into a Java List, can I Paginate that list ? I don't have much experience in Spring, but is there something I am missing when I assume this can be done ?
Cassandra does not support pagination in the sense of pointing to a specific page (limit/offset) but generates a continuation token (PagingState) that is a set of bytes. Pulling a List of records will load all records in memory and possibly exhaust your memory (depending on the amount of data).
Spring Data Cassandra 1.5.0 RC1 comes with a streaming API in CassandraTemplate:
Iterator<Person> it = template.stream("SELECT * FROM person WHERE … ;", Person.class);
while(it.hasNext()) {
// …
}
CassandraTemplate.stream(…) will return an Iterator that operates on an underlying ResultSet. The DataStax driver uses a configurable fetch-size (5000 rows by default) for bulk fetching. Streaming data access can fetch as much or as little data as you require to process data. Data is not retained by the driver nor Spring Data Cassandra, and once the fetched bulk is retrieved from the Iterator, the underlying ResultSet will fetch the next bulk itself.
The other alternative is using ResultSet directly that gives you access to PagingState and do all the continuation/paging business yourself. You would lose all the higher level benefits of Spring Data Cassandra.

JSP Struts Performance/Memory Tips

I am working on a basic Struts based application that is experience major spikes in memory. We have a monitoring tool that will notice one request per user adding 3MB to the JVM heap memory. Are there any tips to encourage earlier garbage collection, free up memory or improve performance?
The application is a basic Struts application but there are a lot of rows in the JSP report, so there may be a lot of objects created. But it isn't stuff you haven't seen before.
Perform a set of database query.
Create an serialized POJO object bean. This represents a row.
Add a row to an array list.
Set the array list to the form object when the action is invoked.
The JSP logic will iterate through the list from the ActionForm and the data is displayed to the user.
Notes:
1. The form is in session scope and possibly that array list of data (maybe this is an issue).
2. The POJO bean contains 20 or so fields, a mix of String or BigDecimal data.
The report can have 300 to 1200 or so rows. So there are at least that many objects created.
Given the information you provide, I'd estimate that you're typically loading 1 to 2 megabytes of data for a result: 750 rows * 20 fields * 100 bytes per field = 1.4 Mb. Now consider all of the temporary objects needed between the database and the final markup. 3 Mb isn't surprising.
I'd only be concerned if that memory seems to have leaked; i.e., the next garbage collection of the young generation space doesn't collect all of those objects.
List item
When desiging reports to be rendered in web application, consider the number of records fetched from database.
If the number of records is high and the overall recordset is taking lot of memory, then consider using pagination of report.
As far as possible donot invoke garbage collector explicitly. This is so because of two reasons:
Garbage collection is costly process
as it scans whole of the memory.
Most of the production servers would
be tuned at JVM level to avoid
explicit garabage collection
I believe the problem is the arraylist in the ActionForm that needs to allocate a huge chunk of memory space. I would write the query results directly to the response: read the row from the resultset, write to response, read next row, write etc. Maybe it's not MVC but it would be better for your heap :-)
ActionForms are fine for CRUD operations, but for reports ... I don't think so.
Note: if the ActionForm has scope=session the instance will be alive (along with the huge arraylist) until session expires. If scope=request the instance will be available for the GC.

Resources