Benchmarking spring data vs JDBI in select from postgres Database - spring

I wanted to compare the performence for Spring data vs JDBI
I used the following versions
Spring Boot 2.2.4.RELEASE
vs
JDBI 3.13.0
the test is fairly simple select * from admin table and convert to a list of Admin object
here is the relevant details
with spring boot
public interface AdminService extends JpaRepository<Admin, Integer> {
}
and for JDBI
public List<Admin> getAdmins() {
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
try {
handle = Sql2oConnection.getInstance().getJdbi().open();
return handle.createQuery(sql).mapToBean(Admin.class).list();
}catch(Exception ex) {
log.error("Could not select admins from admins: {}", ex.getMessage(), ex );
return null;
} finally {
handle.close();
}
}
the test class is executed using junit 5
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
adminService.getAdmins(); // for spring its findAll()
for(Admin admin: admins) {
if(admin.getAdminId() == 654) {
System.out.println("just to simulate work with the data");
}
}
}
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
i was quite shocked to get the following results
Spring Data: 2 seconds
JDBI: 59 seconds
any idea why i got these results? i was expecting JDBI to be faster

The issue was that spring manages the connection life cycle for us and for a good reason
after reading the docs of JDBI
There is a performance penalty every time a connection is allocated
and released. In the example above, the two insertFullContact
operations take separate Connection objects from your database
connection pool.
i changed the test code of the JDBI test to the following
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
handle = Sql2oConnection.getInstance().getJdbi().open();
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
List<Admin> admins = handle.createQuery(sql).mapToBean(Admin.class).list();
if(!admins.isEmpty()) {
for(Admin admin: admins) {
System.out.println(admin.getUsername());
}
}
}
handle.close();
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
this way the connection is opened once and the query runs 1000 times
the final result was 1 second
twice as fast as spring

On the one hand you seem to make some basic mistakes of benchmarking:
You are not warming up the JVM.
You are not using the results in any way.
Therefore what you are seeing might just be effects of different optimisations of the VM.
Look into JMH in order to improve your benchmarks.
Benchmarks with an external resource are extra hard, because you have so many more parameters to control.
One big question is for example if the connection to the database is realistically slow as in most production systems the database will be on a different machine at least virtually, quite possibly on different hardware.
Is that true in your test as well?
Assuming your results are real, the next step is to investigate where the extra time gets spent.
I would expect the most time to be spent with executing the SQL statements and obtaining the result via the network.
Therefore you should inspect what SQL statements actually get executed.
This might point you to one possible answer that JPA is doing lots of lazy loading and hasn't even loaded most of you really need.

Related

Batch update operation using JDBC Template

I have a service where I have to update multiple rows. I was testing for a batch of 2000 rows. Using CrudRepository's saveAll() the update operation was taking 211 seconds.
After looking around for jdbc template I came across this implementation of it: https://mkyong.com/spring/spring-jdbctemplate-batchupdate-example/
My Implementation of it:
#Transactional
public int[][] batchUpdateBseStatus(List<ExchangeTradeStatus> users, int batchSize) {
int[][] updateCounts = jdbcTemplate.batchUpdate(
"update exchange_trade_status set bse_status = ? where id = ?",
users,
batchSize,
new ParameterizedPreparedStatementSetter<ExchangeTradeStatus>() {
public void setValues(PreparedStatement ps, ExchangeTradeStatus user)
throws SQLException {
ps.setString(1, user.getBseStatus().name());
ps.setInt(2, user.getId());
}
});
return updateCounts;
}
For the same update process it's now taking about 105 seconds. Reading more about implementing jdbc batch update I saw a similar implementation to mine who had published this performance:
My time is pretty slow compared to this. Is there any fundamental flaw in my understanding and final implementation of batchUpdate function and how can I improve my time?
Update:
I used these two properties and it gave me an update time of 1.297 seconds for 1970 rows
spring.datasource.hikari.data-source-properties.useConfigs=maxPerformance
spring.datasource.hikari.data-source-properties.rewriteBatchedStatements=true

JCache Hazelcast embedded does not scale

Hello, Stackoverflow Community.
I have a Spring Boot application that uses Jcache with Hazelcast implementation as a cache Framework.
Each Hazelcast node has 5 caches with the size of 50000 elements each. There are 4 Hazelcast Instances that form a cluster.
The problem that I face is the following:
I have a very heavy call that reads data from all four caches. On the initial start, when all caches are yet empty, this call takes up to 600 seconds.
When there is one Hazelcast instance running and all 5 caches are filled with data, then this call happens relatively fast, it takes on average only 4 seconds.
When I start 2 Hazelcast instances and they form a cluster, then the response time gets worse, and the same call takes already 25 seconds on average.
And the more Hazelcast instances I add in a cluster, the longer the response time gets. Of course, I was expecting to see some worse delivery time when data is partitioned among Hazelcast nodes in a cluster. But I did not expect that just by adding one more hazelcast instance, the response time would get 6 - 7 times slower...
Please note, that for simplicity reasons and for testing purposes, I just start four Spring Boot Instances with each Hazelcast embedded node embedded in it on one machine. Therefore, such poor performance cannot be justified by network delays. I assume that this API call is so slow even with Hazelcast because much data needs to be serialized/deserialized when sent among Hazelcast cluster nodes. Please correct me if I am wrong.
The cache data is partitioned evenly among all nodes. I was thinking about adding near cache in order to reduce latency, however, according to the Hazelcast Documentation, the near cache is not available for Jcache Members. In my case, because of some project requirements, I am not able to switch to Jcache Clients to make use of Near Cache. Is there maybe some advice on how to reduce latency in such a scenario?
Thank you in advance.
DUMMY CODE SAMPLES TO DEMONSTRATE THE PROBLEM:
Hazelcast Config: stays default, nothing is changed
Caches:
private void createCaches() {
CacheConfiguration<?, ?> cacheConfig = new CacheConfig<>()
.setEvictionConfig(
new EvictionConfig()
.setEvictionPolicy(EvictionPolicy.LRU)
.setSize(150000)
.setMaxSizePolicy(MaxSizePolicy.ENTRY_COUNT)
)
.setBackupCount(5)
.setInMemoryFormat(InMemoryFormat.OBJECT)
.setManagementEnabled(true)
.setStatisticsEnabled(true);
cacheManager.createCache("books", cacheConfig);
cacheManager.createCache("bottles", cacheConfig);
cacheManager.createCache("chairs", cacheConfig);
cacheManager.createCache("tables", cacheConfig);
cacheManager.createCache("windows", cacheConfig);
}
Dummy Controller:
#GetMapping("/dummy_call")
public String getExampleObjects() { // simulates a situatation where one call needs to fetch data from multiple cached sources.
Instant start = Instant.now();
int i = 0;
while (i != 50000) {
exampleService.getBook(i);
exampleService.getBottle(i);
exampleService.getChair(i);
exampleService.getTable(i);
exampleService.getWindow(i);
i++;
}
Instant end = Instant.now();
return String.format("The heavy call took: %o seconds", Duration.between(start, end).getSeconds());
}
Dummy service:
#Service
public class ExampleService {
#CacheResult(cacheName = "books")
public ExampleBooks getBook(int i) {
try {
Thread.sleep(1); // just to simulate slow service here!
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Book(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "bottles")
public ExampleMooks getBottle(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Bottle(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "chairs")
public ExamplePooks getChair(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Chair(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "tables")
public ExampleRooks getTable(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Table(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "windows")
public ExampleTooks getWindow(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Window(Integer.toString(i), Integer.toString(i));
}
}
If you do the math:
4s / 250 000 lookups is 0.016 ms per local lookup. This seems rather high, but let's take that.
When you add a single node then the data gets partitioned and half of the requests will be served from the other node. If you add 2 more nodes (4 total) then 25 % of the requests will be served locally and 75 % will be served over network. This should explain why the response time grows when you add more nodes.
Even simple ping on localhost takes twice or more time. On a real network the read latency we see in benchmarks is 0.3-0.4 ms per read call. This makes:
0.25 * 250k *0.016 + 0.75 * 250k * 0.3 = ~57 s
You simply won't be able to make so many calls serially over the network (even local one), you need to either
parallelize the calls - use javax.cache.Cache#getAll to reduce the number of calls
you can try enabling reading local backups via com.hazelcast.config.MapConfig#setReadBackupData so there is less requests over the network.
The read backup data feature is only available for IMap, so you would need to use Spring caching with hazelcast-spring module and its com.hazelcast.spring.cache.HazelcastCacheManager:
#Bean
HazelcastCacheManager cacheManager(HazelcastInstance hazelcastInstance) {
return new HazelcastCacheManager(hazelcastInstance);
}
See documentation for more details.

How to prevent data loss from redis where server is stopped forcefully which results in RedisCommandInterruptedException

#Autowired
private StringRedisTemplate stringRedisTemplate;
public List<Object> getDataFromRedis(String redisKey) {
try {
long numberOfEntriesToRead = 60000;
return stringRedisTemplate.executePipelined(
(RedisConnection connection) -> {
StringRedisConnection stringRedisConn =(StringRedisConnection)connection;
for (int index = 0; index < numberOfEntriesToRead; index++) {
stringRedisConn.lPop(redisKey);
}
return null;
});
}catch (RedisCommandInterruptedException e) {
LOGGER.error("Interrupted EXCEPTION :::", e);
}
}
}
I have a method which reads redis content for given key. Now the problem is when my application server is stopped while this method is trying to fetch data from redis i am getting RedisCommandInterruptedException exception which results in loss of some data from redis. So how can i overcome this problem Any suggestions are appreciable.
Pipelines are not atomic operations therefore there is no guarantee that all/none of the commands are executed when an exception happens.
You can use lua scripts or multi command to make run operations in a single transaction.
You can read more about using multi in spring boot data redis in this SO thread and this site.

Spring data Page/Pageable returns duplicates on large data sets?

When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.

Non-Blocking Endpoint: Returning an operation ID to the caller - Would like to get your opinion on my implementation?

Boot Pros,
I recently started to program in spring-boot and I stumbled upon a question where I would like to get your opinion on.
What I try to achieve:
I created a Controller that exposes a GET endpoint, named nonBlockingEndpoint. This nonBlockingEndpoint executes a pretty long operation that is resource heavy and can run between 20 and 40 seconds.(in the attached code, it is mocked by a Thread.sleep())
Whenever the nonBlockingEndpoint is called, the spring application should register that call and immediatelly return an Operation ID to the caller.
The caller can then use this ID to query on another endpoint queryOpStatus the status of this operation. At the beginning it will be started, and once the controller is done serving the reuqest it will be to a code such as SERVICE_OK. The caller then knows that his request was successfully completed on the server.
The solution that I found:
I have the following controller (note that it is explicitely not tagged with #Async)
It uses an APIOperationsManager to register that a new operation was started
I use the CompletableFuture java construct to supply the long running code as a new asynch process by using CompletableFuture.supplyAsync(() -> {}
I immdiatelly return a response to the caller, telling that the operation is in progress
Once the Async Task has finished, i use cf.thenRun() to update the Operation status via the API Operations Manager
Here is the code:
#GetMapping(path="/nonBlockingEndpoint")
public #ResponseBody ResponseOperation nonBlocking() {
// Register a new operation
APIOperationsManager apiOpsManager = APIOperationsManager.getInstance();
final int operationID = apiOpsManager.registerNewOperation(Constants.OpStatus.PROCESSING);
ResponseOperation response = new ResponseOperation();
response.setMessage("Triggered non-blocking call, use the operation id to check status");
response.setOperationID(operationID);
response.setOpRes(Constants.OpStatus.PROCESSING);
CompletableFuture<Boolean> cf = CompletableFuture.supplyAsync(() -> {
try {
// Here we will
Thread.sleep(10000L);
} catch (InterruptedException e) {}
// whatever the return value was
return true;
});
cf.thenRun(() ->{
// We are done with the super long process, so update our Operations Manager
APIOperationsManager a = APIOperationsManager.getInstance();
boolean asyncSuccess = false;
try {asyncSuccess = cf.get();}
catch (Exception e) {}
if(true == asyncSuccess) {
a.updateOperationStatus(operationID, Constants.OpStatus.OK);
a.updateOperationMessage(operationID, "success: The long running process has finished and this is your result: SOME RESULT" );
}
else {
a.updateOperationStatus(operationID, Constants.OpStatus.INTERNAL_ERROR);
a.updateOperationMessage(operationID, "error: The long running process has failed.");
}
});
return response;
}
Here is also the APIOperationsManager.java for completness:
public class APIOperationsManager {
private static APIOperationsManager instance = null;
private Vector<Operation> operations;
private int currentOperationId;
private static final Logger log = LoggerFactory.getLogger(Application.class);
protected APIOperationsManager() {}
public static APIOperationsManager getInstance() {
if(instance == null) {
synchronized(APIOperationsManager.class) {
if(instance == null) {
instance = new APIOperationsManager();
instance.operations = new Vector<Operation>();
instance.currentOperationId = 1;
}
}
}
return instance;
}
public synchronized int registerNewOperation(OpStatus status) {
cleanOperationsList();
currentOperationId = currentOperationId + 1;
Operation newOperation = new Operation(currentOperationId, status);
operations.add(newOperation);
log.info("Registered new Operation to watch: " + newOperation.toString());
return newOperation.getId();
}
public synchronized Operation getOperation(int id) {
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
return op;
}
}
Operation notFound = new Operation(-1, OpStatus.INTERNAL_ERROR);
notFound.setCrated(null);
return notFound;
}
public synchronized void updateOperationStatus (int id, OpStatus newStatus) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setStatus(newStatus);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
public synchronized void updateOperationMessage (int id, String message) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setMessage(message);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
private synchronized void cleanOperationsList() {
Date now = new Date();
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if((now.getTime() - op.getCrated().getTime()) >= Constants.MIN_HOLD_DURATION_OPERATIONS ) {
log.info("Removed operation from watchlist: " + op.toString());
iterator.remove();
}
}
}
}
The questions that I have
Is that concept a valid one that also scales? What could be improved?
Will i run into concurrency issues / race conditions?
Is there a better way to achieve the same in boot spring, but I just didn't find that yet? (maybe with the #Async directive?)
I would be very happy to get your feedback.
Thank you so much,
Peter P
It is a valid pattern to submit a long running task with one request, returning an id that allows the client to ask for the result later.
But there are some things I would suggest to reconsider :
do not use an Integer as id, as it allows an attacker to guess ids and to get the results for those ids. Instead use a random UUID.
if you need to restart your application, all ids and their results will be lost. You should persist them to a database.
Your solution will not work in a cluster with many instances of your application, as each instance would only know its 'own' ids and results. This could also be solved by persisting them to a database or Reddis store.
The way you are using CompletableFuture gives you no control over the number of threads used for the asynchronous operation. It is possible to do this with standard Java, but I would suggest to use Spring to configure the thread pool
Annotating the controller method with #Async is not an option, this does not work no way. Instead put all asynchronous operations into a simple service and annotate this with #Async. This has some advantages :
You can use this service also synchronously, which makes testing a lot easier
You can configure the thread pool with Spring
The /nonBlockingEndpoint should not return the id, but a complete link to the queryOpStatus, including id. The client than can directly use this link without any additional information.
Additionally there are some low level implementation issues which you may also want to change :
Do not use Vector, it synchronizes on every operation. Use a List instead. Iterating over a List is also much easier, you can use for-loops or streams.
If you need to lookup a value, do not iterate over a Vector or List, use a Map instead.
APIOperationsManager is a singleton. That makes no sense in a Spring application. Make it a normal PoJo and create a bean of it, get it autowired into the controller. Spring beans by default are singletons.
You should avoid to do complicated operations in a controller method. Instead move anything into a service (which may be annotated with #Async). This makes testing easier, as you can test this service without a web context
Hope this helps.
Do I need to make database access transactional ?
As long as you write/update only one row, there is no need to make this transactional as this is indeed 'atomic'.
If you write/update many rows at once you should make it transactional to guarantee, that either all rows are updated or none.
However, if two operations (may be from two clients) update the same row, always the last one will win.

Resources