Trying to optimise one of a big query in our GAE/J datastore, and during the experiment found some results which were not expected.
The code itself is pretty straightforward, which I copied below:
public static CursorEntityWrapper<Media> getMediasByUsernameCursored(String username, String cursorString, int batchSize, MediaFetchGroup... fetchGroups) {
PersistenceManager pm = PMF.get().getPersistenceManager();
List<Media> result = new ArrayList<>();
Transaction tx = pm.currentTransaction();
try {
DBUtils.applyFetchGroups(pm, fetchGroups);
tx.begin();
Query query = pm.newQuery(Media.class);
query.getFetchPlan().setFetchSize(batchSize);
if (StringUtils.isNotBlank(cursorString)) {
Cursor c = Cursor.fromWebSafeString(cursorString);
Map<String, Object> extensionMap = new HashMap<>();
extensionMap.put(JDOCursorHelper.CURSOR_EXTENSION, c);
query.setExtensions(extensionMap);
}
query.setFilter("mediaUserString == userParam");
query.declareParameters("java.lang.String userParam");
query.setRange(0, batchSize);
result = (List<Media>) query.execute(username);
Cursor cursor = JDOCursorHelper.getCursor(result);
cursorString = cursor.toWebSafeString();
tx.commit();
} finally {
DBUtils.closeTransaction(tx, log);
DBUtils.closePMAndDetachResult(pm, result);
}
return results;}
As you can see, we have a query method to return all entity Media by username. This entity has some child entities you can pass in as optional parameters.
We got a very poor performance for this query when asking for one of the large fetchgroups. For 60 entities result, it takes about 10 - 15 seconds to return, and when the result is at 200 size, the response can take up to 40 seconds.
One of the interesting experiment was we thought the transaction should have a major impact on the performance, so we take off the transaction from the method, and deployed the application again. But it seems we are wrong that the method became even slower when transaction was not presented.
So my question is why a query without transaction will be slower than with transaction?
Related
The goal of my springBoot webflux r2dbc application is Controller accepts a Request including a list of DB UPDATE or INSERT details, and Response a result summary back.
I can write a ReactiveCrudRepository based repository to implement each DB operation. But I don't know how to write the Service to group the executions of the list of DB operations and compose a result summary response.
I am new to java reactive programing. Thanks for any suggestions and help.
Chen
I get the hint from here: https://www.vinsguru.com/spring-webflux-aggregation/ . Ideas are :
From request to create 3 Monos
Mono<List> monoEndDateSet -- DB Row ids of update operation;
Mono<List> monoCreateList -- DB Row ids of new inserted;
Mono monoRespFilled -- partly fill some known fields;
use Mono.zip aggregate the 3 monos, map and aggregate the Tuple3 to Mono to return.
Below are key part of codes:
public Mono<ChangeSupplyResponse> ChangeSupplies(ChangeSupplyRequest csr){
ChangeSupplyResponse resp = ChangeSupplyResponse.builder().build();
resp.setEventType(csr.getEventType());
resp.setSupplyOperationId(csr.getSupplyOperationId());
resp.setTeamMemberId(csr.getTeamMemberId());
resp.setRequestTimeStamp(csr.getTimestamp());
resp.setProcessStart(OffsetDateTime.now());
resp.setUserId(csr.getUserId());
Mono<List<Long>> monoEndDateSet = getEndDateIdList(csr);
Mono<List<Long>> monoCreateList = getNewSupplyEntityList(csr);
Mono<ChangeSupplyResponse> monoRespFilled = Mono.just(resp);
return Mono.zip(monoRespFilled, monoEndDateSet, monoCreateList).map(this::combine).as(operator::transactional);
}
private ChangeSupplyResponse combine(Tuple3<ChangeSupplyResponse, List<Long>, List<Long>> tuple){
ChangeSupplyResponse resp = tuple.getT1().toBuilder().build();
List<Long> endDateIds = tuple.getT2();
resp.setEndDatedDemandStreamSupplyIds(endDateIds);
List<Long> newIds = tuple.getT3();
resp.setNewCreatedDemandStreamSupplyIds(newIds);
resp.setSuccess(true);
Duration span = Duration.between(resp.getProcessStart(), OffsetDateTime.now());
resp.setProcessDurationMillis(span.toMillis());
return resp;
}
private Mono<List<Long>> getNewSupplyEntityList(ChangeSupplyRequest csr) {
Flux<DemandStreamSupplyEntity> fluxNewCreated = Flux.empty();
for (SrmOperation so : csr.getOperations()) {
if (so.getType() == SrmOperationType.createSupply) {
DemandStreamSupplyEntity e = buildEntity(so, csr);
fluxNewCreated = fluxNewCreated.mergeWith(this.demandStreamSupplyRepository.save(e));
}
}
return fluxNewCreated.map(e -> e.getDemandStreamSupplyId()).collectList();
}
...
I implemented pageable functionality into Criteria API query and I noticed increased memory usage during query execution. I also used spring-data-jpa method query to return same result, but there memory is cleaned up after every batch is processed. I tried detaching, flushing, clearing objects from EntityManager, but memory use would keep going up, occasionally it will drop but not as much as with method queries. My question is what could cause this memory use if objects are detached and how to deal with it?
Memory usage with Criteria API pageable:
Memory usage with method query:
Code
Since I'm also updating entities retrieved from DB, I use approach where I save ID of last processed entity, so when entity gets updated query doesen't skip next selected page. Below I provide code example that is not from real app I'm working on, but it just recreation of the issue I'm having.
Repository code:
#Override
public Slice<Player> getPlayers(int lastId, Pageable pageable) {
List<Predicate> predicates = new ArrayList<>();
CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder();
CriteriaQuery<Player> criteriaQuery = criteriaBuilder.createQuery(Player.class);
Root<Player> root = criteriaQuery.from(Player.class);
predicates.add(criteriaBuilder.greaterThan(root.get("id"), lastId));
criteriaQuery.where(criteriaBuilder.and(predicates.toArray(Predicate[]::new)));
criteriaQuery.orderBy(criteriaBuilder.asc(root.get("id")));
var query = entityManager.createQuery(criteriaQuery);
if (pageable.isPaged()) {
int pageSize = pageable.getPageSize();
int offset = pageable.getPageNumber() > 0 ? pageable.getPageNumber() * pageSize : 0;
// Fetch additional element and skip it based on the pageSize to know hasNext value.
query.setMaxResults(pageSize + 1);
query.setFirstResult(offset);
var resultList = query.getResultList();
boolean hasNext = pageable.isPaged() && resultList.size() > pageSize;
return new SliceImpl<>(hasNext ? resultList.subList(0, pageSize) : resultList, pageable, hasNext);
} else {
return new SliceImpl<>(query.getResultList(), pageable, false);
}
}
Iterating through pageables:
#Override
public Slice<Player> getAllPlayersPageable() {
int lastId = 0;
boolean hasNext = false;
Pageable pageable = PageRequest.of(0, 200);
do {
var players = playerCriteriaRepository.getPlayers(lastId, pageable);
if(!players.isEmpty()){
lastId = players.getContent().get(players.getContent().size() - 1).getId();
for(var player : players){
System.out.println(player.getFirstName());
entityManager.detach(player);
}
}
hasNext = players.hasNext();
} while (hasNext);
return null;
}
I think you are running into a query plan cache issue here that is related to the use of the JPA Criteria API and how numeric values are handled. Hibernate will render all numeric values as literals into an intermediary HQL query string which is then compiled. As you can imagine, every "scroll" to the next page will be a new query string so you gradually fill up the query plan cache.
One possible solution is to use a library like Blaze-Persistence which has a custom JPA Criteria API implementation and a Spring Data integration that will avoid these issues and at the same time improve the performance of your queries due to a better pagination implementation.
All your code would stay the same, you just have to include the integration and configure it as documented in the setup section.
During code optimization I found few areas where I was using findOne() within for loop –
public List<User> validateUsers(List<String> userIds) {
List<User> validUsers = new ArrayList<>();
for ( String userId : userIds) {
User user = userRepository.findOne(userId); //Network hit :: expensive call
//Perform validations
...
//Add valid users to validUsers list
...
}
return validUsers;
}
Above method takes long time if I pass huge list of users to validate. [for 300 users around 5 sec.]
Then I changed above method to use findAll() and perform validations on result collection -
public List<User> validateUsers(List<String> userIds) {
List<User> validUsers = new ArrayList<>();
Iterable<User> itr = userRepository.findAll(userIds); //Only one Network hit
for ( User user : itr) {
//Perform validations
...
//Add valid users to validUsers list
...
}
return validUsers;
}
Now for 300 users, results coming in 100 ms.
Question is: Is there any side effects of using findAll() considering the underlying structure of Cassandra? Also I am using CrudRepository. Should I use CassandraRepository?
Following are the parameters to think of when you are attempting this.
How big is the users table, if you are using findAll.
Partition keys for the user table
As Cassandra queries are faster with the primary key fields, findOne might perform better with the large amount of data.
However, can you try
List<T> findAllById(Iterable<ID> ids);
from org.springframework.data.cassandra.repository.CassandraRepository
I'm working on a process that checks and updates data from Oracle database. I'm using hibernate and spring framework in my application.
The application reads a csv file, processes the content, then persiste entities :
public class Main() {
Input input = ReadCSV(path);
EntityList resultList = Process.process(input);
WriteResult.write(resultList);
...
}
// Process class that loops over input
public class Process{
public EntityList process(Input input) :
EntityList results = ...;
...
for(Line line : input.readLine()){
results.add(ProcessLine.process(line))
...
}
return results;
}
// retrieving and updating entities
Class ProcessLine {
#Autowired
DomaineRepository domaineRepository;
#Autowired
CompanyDomaineService companydomaineService
#Transactional
public MyEntity process(Line line){
// getcompanyByXX is CrudRepository method with #Query that returns an entity object
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
if(companyToDetach == null || companyToAttach == null){
throw new CustomException("Custom Exception");
}
// AttachCompany retrieves some entity relationEntity, then removes companyToDetach and adds CompanyToAttach. this updates relationEntity.company attribute.
companydomaineService.attachCompany(companyToAttach, companyToDetach);
return companyToAttach;
}
}
public class WriteResult{
#Autowired
DomaineRepository domaineRepository;
#Transactional
public void write(EntityList results) {
for (MyEntity result : results){
domaineRepository.save(result)
}
}
}
The application works well on files with few lines, but when i try to process large files (200 000 lines), the performance slows drastically, and i get a SQL timeout.
I suspect cache issues, but i'm wondering if saving all the entities at the end of the processing isn't a bad practice ?
The problem is your for loop which is doing individual saves on the result and thus does single inserts slowing it down. Hibernate and spring support batch inserts and should be done when ever possible.
something like domaineRepository.saveAll(results)
Since you are processing lot of data it might be better to do things in batches so instead of getting one company to attach you should get a list of companies to attach processes those then get a list of companies to detach and process those
public EntityList process(Input input) :
EntityList results;
List<Code> companiesToAdd = new ArrayList<>();
List<Siret> companiesToRemove = new ArrayList<>();
for(Line line : input.readLine()){
companiesToAdd.add(line.getCode());
companiesToRemove.add(line.getSiret());
...
}
results = process(companiesToAdd, companiesToRemove);
return results;
}
public MyEntity process(List<Code> companiesToAdd, List<Siret> companiesToRemove) {
List<MyEntity> attachList = domaineRepository.getCompanyByCodeIn(companiesToAdd);
List<MyEntity> detachList = domaineRepository.getCompanyBySiretIn(companiesToRemove);
if (attachList.isEmpty() || detachList.isEmpty()) {
throw new CustomException("Custom Exception");
}
companydomaineService.attachCompany(attachList, detachList);
return attachList;
}
The above code is just sudo code to point you in the right direction, will need to work out what works for you.
For every line you read you are doing 2 read operations here
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
You can read more than one line and us the in query and then process that list of companies
Is there any way I can get resultset object from one of jdbctemplate query methods?
I have a code like
List<ResultSet> rsList = template.query(finalQuery, new RowMapper<ResultSet>() {
public ResultSet mapRow(ResultSet rs, int rowNum) throws SQLException {
return rs;
}
}
);
I wanted to execute my sql statement stored in finalQuery String and get the resultset. The query is a complex join on 6 to 7 tables and I am select 4-5 columns from each table and wanted to get the metadata of those columns to transform data types and data to downstream systems.
If it is a simple query and I am fetching form only one table I can use RowMapper#mapRow and inside that maprow method i can call ResultsetExtractor.extractData to get list of results; but in this case I have complex joins in my query and I am trying to get resultset Object and from that resultset metadata...
The above code is not good because for each result it will return same resultset object and I dont want to store them in list ...
Once more thing is if maprow is called for each result from my query will JDBCTemplate close the rs and connection even though my list has reference to RS object?
Is there any simple method like jdbcTemplate.queryForResultSet(sql) ?
Now I have implemented my own ResultSet Extractor to process and insert data into downstream systems
sourceJdbcTemplate.query(finalQuery, new CustomResultSetProcessor(targetTable, targetJdbcTemplate));
This CustomResultSetProcessor implements ResultSetExtractor and in extractData method I am calling 3 different methods 1 is get ColumnTypes form rs.getMetaData() and second is getColumnTypes of target metadata by running
SELECT NAME, COLTYPE, TBNAME FROM SYSIBM.SYSCOLUMNS WHERE TBNAME ='TABLENAME' AND TABCREATOR='TABLE CREATOR'
and in 3rd method I am building the insert statement (prepared) form target columntypes and finally calling that using
new BatchPreparedStatementSetter()
{
#Override
public void setValues(PreparedStatement insertStmt, int i) throws SQLException{} }
Hope this helps to others...
Note that the whole point of Spring JDBC Template is that it automatically closes all resources, including ResultSet, after execution of callback method. Therefore it would be better to extract necessary data inside a callback method and allow Spring to close the ResultSet after it.
If result of data extraction is not a List, you can use ResultSetExtractor instead of RowMapper:
SomeComplexResult r = template.query(finalQuery,
new ResultSetExtractor<SomeComplexResult>() {
public SomeResult extractData(ResultSet) {
// do complex processing of ResultSet and return its result as SomeComplexResult
}
});
Something like this would also work:
Connection con = DataSourceUtils.getConnection(dataSource); // your datasource
Statement s = con.createStatement();
ResultSet rs = s.executeQuery(query); // your query
ResultSetMetaData rsmd = rs.getMetaData();
Although I agree with #axtavt that ResultSetExtractor is preferred in Spring environment, it does force you to execute the query.
The code below does not require you to do so, so that the client code is not required to provide the actual arguments for the query parameters:
public SomeResult getMetadata(String querySql) throws SQLException {
Assert.hasText(querySql);
DataSource ds = jdbcTemplate.getDataSource();
Connection con = null;
PreparedStatement ps = null;
try {
con = DataSourceUtils.getConnection(ds);
ps = con.prepareStatement(querySql);
ResultSetMetaData md = ps.getMetaData(); //<-- the query is compiled, but not executed
return processMetadata(md);
} finally {
JdbcUtils.closeStatement(ps);
DataSourceUtils.releaseConnection(con, ds);
}
}