Batch update operation using JDBC Template - spring-boot

I have a service where I have to update multiple rows. I was testing for a batch of 2000 rows. Using CrudRepository's saveAll() the update operation was taking 211 seconds.
After looking around for jdbc template I came across this implementation of it: https://mkyong.com/spring/spring-jdbctemplate-batchupdate-example/
My Implementation of it:
#Transactional
public int[][] batchUpdateBseStatus(List<ExchangeTradeStatus> users, int batchSize) {
int[][] updateCounts = jdbcTemplate.batchUpdate(
"update exchange_trade_status set bse_status = ? where id = ?",
users,
batchSize,
new ParameterizedPreparedStatementSetter<ExchangeTradeStatus>() {
public void setValues(PreparedStatement ps, ExchangeTradeStatus user)
throws SQLException {
ps.setString(1, user.getBseStatus().name());
ps.setInt(2, user.getId());
}
});
return updateCounts;
}
For the same update process it's now taking about 105 seconds. Reading more about implementing jdbc batch update I saw a similar implementation to mine who had published this performance:
My time is pretty slow compared to this. Is there any fundamental flaw in my understanding and final implementation of batchUpdate function and how can I improve my time?
Update:
I used these two properties and it gave me an update time of 1.297 seconds for 1970 rows
spring.datasource.hikari.data-source-properties.useConfigs=maxPerformance
spring.datasource.hikari.data-source-properties.rewriteBatchedStatements=true

Related

Benchmarking spring data vs JDBI in select from postgres Database

I wanted to compare the performence for Spring data vs JDBI
I used the following versions
Spring Boot 2.2.4.RELEASE
vs
JDBI 3.13.0
the test is fairly simple select * from admin table and convert to a list of Admin object
here is the relevant details
with spring boot
public interface AdminService extends JpaRepository<Admin, Integer> {
}
and for JDBI
public List<Admin> getAdmins() {
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
try {
handle = Sql2oConnection.getInstance().getJdbi().open();
return handle.createQuery(sql).mapToBean(Admin.class).list();
}catch(Exception ex) {
log.error("Could not select admins from admins: {}", ex.getMessage(), ex );
return null;
} finally {
handle.close();
}
}
the test class is executed using junit 5
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
adminService.getAdmins(); // for spring its findAll()
for(Admin admin: admins) {
if(admin.getAdminId() == 654) {
System.out.println("just to simulate work with the data");
}
}
}
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
i was quite shocked to get the following results
Spring Data: 2 seconds
JDBI: 59 seconds
any idea why i got these results? i was expecting JDBI to be faster
The issue was that spring manages the connection life cycle for us and for a good reason
after reading the docs of JDBI
There is a performance penalty every time a connection is allocated
and released. In the example above, the two insertFullContact
operations take separate Connection objects from your database
connection pool.
i changed the test code of the JDBI test to the following
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
handle = Sql2oConnection.getInstance().getJdbi().open();
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
List<Admin> admins = handle.createQuery(sql).mapToBean(Admin.class).list();
if(!admins.isEmpty()) {
for(Admin admin: admins) {
System.out.println(admin.getUsername());
}
}
}
handle.close();
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
this way the connection is opened once and the query runs 1000 times
the final result was 1 second
twice as fast as spring
On the one hand you seem to make some basic mistakes of benchmarking:
You are not warming up the JVM.
You are not using the results in any way.
Therefore what you are seeing might just be effects of different optimisations of the VM.
Look into JMH in order to improve your benchmarks.
Benchmarks with an external resource are extra hard, because you have so many more parameters to control.
One big question is for example if the connection to the database is realistically slow as in most production systems the database will be on a different machine at least virtually, quite possibly on different hardware.
Is that true in your test as well?
Assuming your results are real, the next step is to investigate where the extra time gets spent.
I would expect the most time to be spent with executing the SQL statements and obtaining the result via the network.
Therefore you should inspect what SQL statements actually get executed.
This might point you to one possible answer that JPA is doing lots of lazy loading and hasn't even loaded most of you really need.

Spring data Page/Pageable returns duplicates on large data sets?

When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.

Spring Data JPA projection performance

This projection:
public interface IDate {
UUID getId();
Long getLatestTime();
default DateTime getLatestDate() {
Long maximumTimeLastModified = getLatestTime();
Date maxDate = new Date(maximumTimeLastModified.longValue());
return new DateTime(maxDate);
}
}
was created and added to the JPA Repository:
List<IDate> findLatestDates(Set<UUID> ids);
Functionally this works perfectly and is very clean. However, the performance was slow - it took nearly twice as long as simply returning List<Object[]>. (And processing those results in Java). Specifically, a web request took 12 seconds to complete with the projection use, but only 7 seconds without it. Does anyone know why and if there is a way to improve? In general, are there known performance impacts of using Projections that all should be aware of?

LeaseExpiredException with custom UDF in Hive

I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?
Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.

Non-Blocking Endpoint: Returning an operation ID to the caller - Would like to get your opinion on my implementation?

Boot Pros,
I recently started to program in spring-boot and I stumbled upon a question where I would like to get your opinion on.
What I try to achieve:
I created a Controller that exposes a GET endpoint, named nonBlockingEndpoint. This nonBlockingEndpoint executes a pretty long operation that is resource heavy and can run between 20 and 40 seconds.(in the attached code, it is mocked by a Thread.sleep())
Whenever the nonBlockingEndpoint is called, the spring application should register that call and immediatelly return an Operation ID to the caller.
The caller can then use this ID to query on another endpoint queryOpStatus the status of this operation. At the beginning it will be started, and once the controller is done serving the reuqest it will be to a code such as SERVICE_OK. The caller then knows that his request was successfully completed on the server.
The solution that I found:
I have the following controller (note that it is explicitely not tagged with #Async)
It uses an APIOperationsManager to register that a new operation was started
I use the CompletableFuture java construct to supply the long running code as a new asynch process by using CompletableFuture.supplyAsync(() -> {}
I immdiatelly return a response to the caller, telling that the operation is in progress
Once the Async Task has finished, i use cf.thenRun() to update the Operation status via the API Operations Manager
Here is the code:
#GetMapping(path="/nonBlockingEndpoint")
public #ResponseBody ResponseOperation nonBlocking() {
// Register a new operation
APIOperationsManager apiOpsManager = APIOperationsManager.getInstance();
final int operationID = apiOpsManager.registerNewOperation(Constants.OpStatus.PROCESSING);
ResponseOperation response = new ResponseOperation();
response.setMessage("Triggered non-blocking call, use the operation id to check status");
response.setOperationID(operationID);
response.setOpRes(Constants.OpStatus.PROCESSING);
CompletableFuture<Boolean> cf = CompletableFuture.supplyAsync(() -> {
try {
// Here we will
Thread.sleep(10000L);
} catch (InterruptedException e) {}
// whatever the return value was
return true;
});
cf.thenRun(() ->{
// We are done with the super long process, so update our Operations Manager
APIOperationsManager a = APIOperationsManager.getInstance();
boolean asyncSuccess = false;
try {asyncSuccess = cf.get();}
catch (Exception e) {}
if(true == asyncSuccess) {
a.updateOperationStatus(operationID, Constants.OpStatus.OK);
a.updateOperationMessage(operationID, "success: The long running process has finished and this is your result: SOME RESULT" );
}
else {
a.updateOperationStatus(operationID, Constants.OpStatus.INTERNAL_ERROR);
a.updateOperationMessage(operationID, "error: The long running process has failed.");
}
});
return response;
}
Here is also the APIOperationsManager.java for completness:
public class APIOperationsManager {
private static APIOperationsManager instance = null;
private Vector<Operation> operations;
private int currentOperationId;
private static final Logger log = LoggerFactory.getLogger(Application.class);
protected APIOperationsManager() {}
public static APIOperationsManager getInstance() {
if(instance == null) {
synchronized(APIOperationsManager.class) {
if(instance == null) {
instance = new APIOperationsManager();
instance.operations = new Vector<Operation>();
instance.currentOperationId = 1;
}
}
}
return instance;
}
public synchronized int registerNewOperation(OpStatus status) {
cleanOperationsList();
currentOperationId = currentOperationId + 1;
Operation newOperation = new Operation(currentOperationId, status);
operations.add(newOperation);
log.info("Registered new Operation to watch: " + newOperation.toString());
return newOperation.getId();
}
public synchronized Operation getOperation(int id) {
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
return op;
}
}
Operation notFound = new Operation(-1, OpStatus.INTERNAL_ERROR);
notFound.setCrated(null);
return notFound;
}
public synchronized void updateOperationStatus (int id, OpStatus newStatus) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setStatus(newStatus);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
public synchronized void updateOperationMessage (int id, String message) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setMessage(message);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
private synchronized void cleanOperationsList() {
Date now = new Date();
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if((now.getTime() - op.getCrated().getTime()) >= Constants.MIN_HOLD_DURATION_OPERATIONS ) {
log.info("Removed operation from watchlist: " + op.toString());
iterator.remove();
}
}
}
}
The questions that I have
Is that concept a valid one that also scales? What could be improved?
Will i run into concurrency issues / race conditions?
Is there a better way to achieve the same in boot spring, but I just didn't find that yet? (maybe with the #Async directive?)
I would be very happy to get your feedback.
Thank you so much,
Peter P
It is a valid pattern to submit a long running task with one request, returning an id that allows the client to ask for the result later.
But there are some things I would suggest to reconsider :
do not use an Integer as id, as it allows an attacker to guess ids and to get the results for those ids. Instead use a random UUID.
if you need to restart your application, all ids and their results will be lost. You should persist them to a database.
Your solution will not work in a cluster with many instances of your application, as each instance would only know its 'own' ids and results. This could also be solved by persisting them to a database or Reddis store.
The way you are using CompletableFuture gives you no control over the number of threads used for the asynchronous operation. It is possible to do this with standard Java, but I would suggest to use Spring to configure the thread pool
Annotating the controller method with #Async is not an option, this does not work no way. Instead put all asynchronous operations into a simple service and annotate this with #Async. This has some advantages :
You can use this service also synchronously, which makes testing a lot easier
You can configure the thread pool with Spring
The /nonBlockingEndpoint should not return the id, but a complete link to the queryOpStatus, including id. The client than can directly use this link without any additional information.
Additionally there are some low level implementation issues which you may also want to change :
Do not use Vector, it synchronizes on every operation. Use a List instead. Iterating over a List is also much easier, you can use for-loops or streams.
If you need to lookup a value, do not iterate over a Vector or List, use a Map instead.
APIOperationsManager is a singleton. That makes no sense in a Spring application. Make it a normal PoJo and create a bean of it, get it autowired into the controller. Spring beans by default are singletons.
You should avoid to do complicated operations in a controller method. Instead move anything into a service (which may be annotated with #Async). This makes testing easier, as you can test this service without a web context
Hope this helps.
Do I need to make database access transactional ?
As long as you write/update only one row, there is no need to make this transactional as this is indeed 'atomic'.
If you write/update many rows at once you should make it transactional to guarantee, that either all rows are updated or none.
However, if two operations (may be from two clients) update the same row, always the last one will win.

Resources