taking time to read the records and save in the other table using java streams and spring boot JPA - spring-boot

From the below code, i want to save the details from PersonInfoEntity table and for each personInfo, I want to store the records in ResearchInfoEntity.
I have around 100,000 records to insert from PersonInfoEntity to ResearchInfoEntity. Issue is with the below code, it is talking lot of time to save the records in the ResearchInfo table. In almost 3 hours it just stored around 2000 records. Please let me know where it is taking time to execute or if any code optimzation is required to insert bulk records.
Sample code :
List<PersonInfoEntity> PersonInfoEntityList = personInfoRepository.findAll();
Map<Long, List<PersonInfoEntity>> personInfoEntityMap = PersonInfoEntityList.stream().
collect(Collectors.groupingBy(
personInfoResponse -> personInfoResponse.getPerson().getPersonId()
));
List<ResearchEntity> researchEntityList = researchRepository.findAll();
List<ResearchInfoEntity> researchInfoEntityList = new ArrayList<>();
for(ResearchEntity researchEntity : researchEntityList){
List<PersonInfoEntity> personInfoResponseList1 = personInfoEntityMap.get(researchEntity.getPerson().getPersonId());
if(Objects.nonNull(personInfoResponseList1)) {
for (PersonInfoEntity PersonInfoEntity : personInfoResponseList1) {
ResearchInfoEntity researchInfoEntity = new ResearchInfoEntity();
researchInfoEntity.setRecovery(ResearchEntity);
researchInfoEntity.setMilestoneGroupId(PersonInfoEntity.getMilestoneGroupId());
researchInfoEntity.setMilestoneId(PersonInfoEntity.getMilestoneId());
researchInfoEntity.setMilestoneStepId(PersonInfoEntity.getMilestoneStepId());
researchInfoEntity.setMilestoneStepValue(PersonInfoEntity.getMilestoneStepValue());
researchInfoEntity.setCreateBy(PersonInfoEntity.getCreateBy());
researchInfoEntity.setCreateTime(PersonInfoEntity.getCreateTime());
researchInfoEntity.setUpdateBy(PersonInfoEntity.getUpdateBy());
researchInfoEntity.setUpdateTime(PersonInfoEntity.getUpdateTime());
researchInfoEntityList.add(researchInfoEntity);
// researchInfoEntityRepository.save(recoveryMilestoneStep);
}
researchInfoEntityRepository.saveAll(researchInfoEntityList);
}
}

Your mapping of PersonInfoEntity to researchInfoEntity could be done asynchronously.
You could also try to use parallelStream:
public void yourMethod() {
List<PersonInfoEntity> PersonInfoEntityList = personInfoRepository.findAll();
Map<Long, List<PersonInfoEntity>> personInfoEntityMap = PersonInfoEntityList.stream().
collect(Collectors.groupingBy(
personInfoResponse -> personInfoResponse.getPerson().getPersonId()
));
List<ResearchEntity> researchEntityList = researchRepository.findAll();
List<ResearchInfoEntity> researchInfoEntityList = new ArrayList<>();
for (ResearchEntity researchEntity : researchEntityList) {
List<PersonInfoEntity> personInfoResponseList1 = personInfoEntityMap.get(researchEntity.getPerson().getPersonId());
if (Objects.nonNull(personInfoResponseList1)) {
List<ResearchInfoEntity> researchInfoListFromPerson = personInfoResponseList1
.parallelStream() // <--
.map(this::toResearchInfoEntity)
.collect(Collectors.toList());
researchInfoEntityList.addAll(researchInfoListFromPerson);
}
}
researchInfoEntityRepository.saveAll(researchInfoEntityList);
}
private ResearchInfoEntity toResearchInfoEntity(PersonInfoEntity personInfoEntity) {
ResearchInfoEntity researchInfoEntity = new ResearchInfoEntity();
researchInfoEntity.setRecovery(ResearchEntity);
researchInfoEntity.setMilestoneGroupId(PersonInfoEntity.getMilestoneGroupId());
researchInfoEntity.setMilestoneId(PersonInfoEntity.getMilestoneId());
researchInfoEntity.setMilestoneStepId(PersonInfoEntity.getMilestoneStepId());
researchInfoEntity.setMilestoneStepValue(PersonInfoEntity.getMilestoneStepValue());
researchInfoEntity.setCreateBy(PersonInfoEntity.getCreateBy());
researchInfoEntity.setCreateTime(PersonInfoEntity.getCreateTime());
researchInfoEntity.setUpdateBy(PersonInfoEntity.getUpdateBy());
researchInfoEntity.setUpdateTime(PersonInfoEntity.getUpdateTime());
return researchInfoEntity;
}
Also, trying to work with 100 000 elements at once takes up a lot of memory. You could try processing your elements in batches.
For example:
public void export(int batchSize) {
int numberOfElementFetched;
int pageCount = 0;
do {
// entityManager.clear(); // Needed only if you are in a transactional state, you need to clear the entity manager.
// Otherwise, for every iteration, it will keep previous fetched elements in memory
PageRequest requestByBatch = PageRequest.of(pageCount, batchSize, Sort.by(Sort.Direction.ASC));
numberOfElementFetched = yourMethod(requestByBatch);
pageCount++;
}
while (numberOfElementFetched == batchSize);
}
public void yourMethod(PageRequest pageRequest) {
List<PersonInfoEntity> PersonInfoEntityList = personInfoRepository.findAll();
Map<Long, List<PersonInfoEntity>> personInfoEntityMap = PersonInfoEntityList.stream().
collect(Collectors.groupingBy(
personInfoResponse -> personInfoResponse.getPerson().getPersonId()
));
List<ResearchEntity> researchEntityList = researchRepository.findAll(pageRequest).getContent();
List<ResearchInfoEntity> researchInfoEntityList = new ArrayList<>();
for (ResearchEntity researchEntity : researchEntityList) {
List<PersonInfoEntity> personInfoResponseList1 = personInfoEntityMap.get(researchEntity.getPerson().getPersonId());
if (Objects.nonNull(personInfoResponseList1)) {
for (PersonInfoEntity PersonInfoEntity : personInfoResponseList1) {
ResearchInfoEntity researchInfoEntity = new ResearchInfoEntity();
researchInfoEntity.setRecovery(ResearchEntity);
researchInfoEntity.setMilestoneGroupId(PersonInfoEntity.getMilestoneGroupId());
researchInfoEntity.setMilestoneId(PersonInfoEntity.getMilestoneId());
researchInfoEntity.setMilestoneStepId(PersonInfoEntity.getMilestoneStepId());
researchInfoEntity.setMilestoneStepValue(PersonInfoEntity.getMilestoneStepValue());
researchInfoEntity.setCreateBy(PersonInfoEntity.getCreateBy());
researchInfoEntity.setCreateTime(PersonInfoEntity.getCreateTime());
researchInfoEntity.setUpdateBy(PersonInfoEntity.getUpdateBy());
researchInfoEntity.setUpdateTime(PersonInfoEntity.getUpdateTime());
researchInfoEntityList.add(researchInfoEntity);
}
}
}
researchInfoEntityRepository.saveAll(researchInfoEntityList);
return researchEntityList.getSize();
}

Related

Spring GCP - Datastore performance: Batch processing, iteration through all entity list is very slow

Following code is work really slow, almost 30 second to process 400 entities:
int page = 0;
org.springframework.data.domain.Page<MyEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAll(PageRequest.of(page, 400, Sort.by("date")));
} else {
slice = repo.findAll(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
slice.getContent().forEach(v -> v.setApp(SApplication.NAME_XXX));
repo.saveAll(slice.getContent());
LOGGER.info("processed: " + page);
page++;
}
I use following instead, 4-6 sec per 400 entities (gcp lib to work with datastore)
Datastore service = DatastoreOptions.getDefaultInstance().getService();
StructuredQuery.Builder<?> query = Query.newEntityQueryBuilder();
int limit = 400;
query.setKind("ENTITY_KIND").setLimit(limit);
int count = 0;
Cursor cursor = null;
while (true) {
if (cursor != null) {
query.setStartCursor(cursor);
}
QueryResults<?> queryResult = service.run(query.build());
List<Entity> entityList = new ArrayList<>();
while (queryResult.hasNext()) {
Entity loadEntity = (Entity) queryResult.next();
Entity.Builder newEntity = Entity.newBuilder(loadEntity).set("app", SApplication.NAME_XXX.name());
entityList.add(newEntity.build());
}
service.put(entityList.toArray(new Entity[0]));
count += entityList.size();
if (entityList.size() == limit) {
cursor = queryResult.getCursorAfter();
} else {
break;
}
LOGGER.info("Processed: {}", count);
}
Why I can't use spring to do that batch processing?
Full discussion here: https://github.com/spring-cloud/spring-cloud-gcp/issues/1824
First:
you need to use correct lib version: at least 1.2.0.M2
Second:
you need to implement new method in repository interface:
#Query("select * from your_kind")
Slice<TestEntity> findAllSlice(Pageable pageable);
Final code looks like:
LOGGER.info("start");
int page = 0;
Slice<TestEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAllSlice(DatastorePageable.of(page, 400, Sort.by("date")));
} else {
slice = repo.findAllSlice(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
slice.getContent().forEach(v -> v.setApp("xx"));
repo.saveAll(slice.getContent());
LOGGER.info("processed: " + page);
page++;
}
LOGGER.info("end");

HBase Aggregation

I'm having some trouble doing aggregation on a particular column in HBase.
This is the snippet of code I tried:
Configuration config = HBaseConfiguration.create();
AggregationClient aggregationClient = new AggregationClient(config);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("drs"), Bytes.toBytes("count"));
ColumnInterpreter<Long, Long> ci = new LongColumnInterpreter();
Long sum = aggregationClient.sum(Bytes.toBytes("DEMO_CALCULATIONS"), ci , scan);
System.out.println(sum);
sum returns a value of null.
The aggregationClient API works fine if I do a rowcount.
I was trying to follow the directions in http://michaelmorello.blogspot.in/2012/01/row-count-hbase-aggregation-example.html
Could there be a problem with me using a LongColumnInterpreter when the 'count' field was an int? What am I missing in here?
You can only use long(8bytes) to do sum with default setting.
Cause in the code of AggregateImplementation's getSum method, it handle all the returned KeyValue as long.
List<KeyValue> results = new ArrayList<KeyValue>();
try {
boolean hasMoreRows = false;
do {
hasMoreRows = scanner.next(results);
for (KeyValue kv : results) {
temp = ci.getValue(colFamily, qualifier, kv);
if (temp != null)
sumVal = ci.add(sumVal, ci.castToReturnType(temp));
}
results.clear();
} while (hasMoreRows);
} finally {
scanner.close();
}
and in LongColumnInterpreter
public Long getValue(byte[] colFamily, byte[] colQualifier, KeyValue kv)
throws IOException {
if (kv == null || kv.getValueLength() != Bytes.SIZEOF_LONG)
return null;
return Bytes.toLong(kv.getBuffer(), kv.getValueOffset());
}

How can improve this Linq query expressions performance?

public bool SaveValidTicketNos(string id,string[] ticketNos, string checkType, string checkMan)
{
bool result = false;
List<Carstartlistticket>enties=new List<Carstartlistticket>();
using (var context = new MiniSysDataContext())
{
try
{
foreach (var ticketNo in ticketNos)
{
Orderticket temp = context.Orderticket.ByTicketNo(ticketNo).SingleOrDefault();
if (temp != null)
{
Ticketline ticketline= temp.Ticketline;
string currencyType = temp.CurrencyType;
float personAllowance=GetPersonCountAllowance(context,ticketline, currencyType);
Carstartlistticket carstartlistticket = new Carstartlistticket()
{
CsltId = Guid.NewGuid().ToString(),
Carstartlist = new Carstartlist(){CslId = id},
LeaveDate = temp.LeaveDate,
OnPointName = temp.OnpointName,
OffPointName = temp.OffpointName,
OutTicketMan = temp.OutBy,
TicketNo = temp.TicketNo,
ChekMan = checkMan,
Type = string.IsNullOrEmpty(checkType)?(short?)null:Convert.ToInt16(checkType),
CreatedOn = DateTime.Now,
CreatedBy = checkMan,
NumbserAllowance = personAllowance
};
enties.Add(carstartlistticket);
}
}
context.BeginTransaction();
context.Carstartlistticket.InsertAllOnSubmit(enties);
context.SubmitChanges();
bool changeStateResult=ChangeTicketState(context, ticketNos,checkMan);
if(changeStateResult)
{
context.CommitTransaction();
result = true;
}
else
{
context.RollbackTransaction();
}
}
catch (Exception e)
{
LogHelper.WriteLog(string.Format("CarstartlistService.SaveValidTicketNos({0},{1},{2},{3})",id,ticketNos,checkType,checkMan),e);
context.RollbackTransaction();
}
}
return result;
}
My code is above. I doubt these code have terrible poor performance. The poor performance in the point
Orderticket temp = context.Orderticket.ByTicketNo(ticketNo).SingleOrDefault();
,actually, I got an string array through the method args,then I want to get all data by ticketNos from database, here i use a loop,I know if i write my code like that ,there will be cause performance problem and it will lead one more time database access,how can avoid this problem and improve the code performance,for example ,geting all data by only on databse access
I forget to tell you the ORM I use ,en ,the ORM is PlinqO based NHibernate
i am looking forward to having your every answer,thank you
using plain NHibernate
var tickets = session.QueryOver<OrderTicket>()
.WhereRestrictionOn(x => x.TicketNo).IsIn(ticketNos)
.List();
short? type = null;
short typeValue;
if (!string.IsNullOrEmpty(checkType) && short.TryParse(checkType, out typeValue))
type = typeValue;
var entitiesToSave = tickets.Select(ticket => new Carstartlistticket
{
CsltId = Guid.NewGuid().ToString(),
Carstartlist = new Carstartlist() { CslId = id },
LeaveDate = ticket.LeaveDate,
OnPointName = ticket.OnpointName,
OffPointName = ticket.OffpointName,
OutTicketMan = ticket.OutBy,
TicketNo = ticket.TicketNo,
ChekMan = checkMan,
CreatedOn = DateTime.Now,
CreatedBy = checkMan,
Type = type,
NumbserAllowance = GetPersonCountAllowance(context, ticket.Ticketline, ticket.CurrencyType)
});
foreach (var entity in entitiesToSave)
{
session.Save(entity);
}
to enhance this further try to preload all needed PersonCountAllowances

Generics around Entityframework DbContext causes performance degradation?

I wrote a simple import/export application that transforms data from source->destination using EntityFramework and AutoMapper. It basically:
selects batchSize of records from the source table
'maps' data from source->destination entity
add new destination entities to destination table and saves context
I move around 500k records in under 5 minutes. After I refactored the code using generics the performance drops drastically to 250 records in 5 minutes.
Are my delegates that return DbSet<T> properties on the DbContext causing these problems? Or is something else going on?
Fast non-generic code:
public class Importer
{
public void ImportAddress()
{
const int batchSize = 50;
int done = 0;
var src = new SourceDbContext();
var count = src.Addresses.Count();
while (done < count)
{
using (var dest = new DestinationDbContext())
{
var list = src.Addresses.OrderBy(x => x.AddressId).Skip(done).Take(batchSize).ToList();
list.ForEach(x => dest.Address.Add(Mapper.Map<Addresses, Address>(x)));
done += batchSize;
dest.SaveChanges();
}
}
src.Dispose();
}
}
(Very) slow generic code:
public class Importer<TSourceContext, TDestinationContext>
where TSourceContext : DbContext
where TDestinationContext : DbContext
{
public void Import<TSourceEntity, TSourceOrder, TDestinationEntity>(Func<TSourceContext, DbSet<TSourceEntity>> getSourceSet, Func<TDestinationContext, DbSet<TDestinationEntity>> getDestinationSet, Func<TSourceEntity, TSourceOrder> getOrderBy)
where TSourceEntity : class
where TDestinationEntity : class
{
const int batchSize = 50;
int done = 0;
var ctx = Activator.CreateInstance<TSourceContext>();
//Does this getSourceSet delegate cause problems perhaps?
//Added this
var set = getSourceSet(ctx);
var count = set.Count();
while (done < count)
{
using (var dctx = Activator.CreateInstance<TDestinationContext>())
{
var list = set.OrderBy(getOrderBy).Skip(done).Take(batchSize).ToList();
//Or is the db-side paging mechanism broken by the getSourceSet delegate?
//Added this
var destSet = getDestinationSet(dctx);
list.ForEach(x => destSet.Add(Mapper.Map<TSourceEntity, TDestinationEntity>(x)));
done += batchSize;
dctx.SaveChanges();
}
}
ctx.Dispose();
}
}
Problem is invocation of the Func delegates you're doing a lot. Cache the resulting values in variables and it'll be fine.

Nhibernate paging performance

I have table that contains more than 12 millions of rows.
I need to index this rows using Lucene.NET (I need to perform initial indexing).
So I try to index in batch manner, by reading batch packets from sql (1000 rows per batch).
Here is how it looks:
public void BuildInitialBookSearchIndex()
{
FSDirectory directory = null;
IndexWriter writer = null;
var type = typeof(Book);
var info = new DirectoryInfo(GetIndexDirectory());
//if (info.Exists)
//{
// info.Delete(true);
//}
try
{
directory = FSDirectory.GetDirectory(Path.Combine(info.FullName, type.Name), true);
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
}
finally
{
if (directory != null)
{
directory.Close();
}
if (writer != null)
{
writer.Close();
}
}
var fullTextSession = Search.CreateFullTextSession(Session);
var currentIndex = 0;
const int batchSize = 1000;
while (true)
{
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = Session.BeginTransaction())
{
foreach (var entity in entities)
{
fullTextSession.Index(entity);
}
currentIndex += batchSize;
Session.Flush();
tx.Commit();
Session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
But, the operation times out when current index is bigger then 6-7 million. NHibernate Pagging throws time out.
Any suggestions, any other way in NHibernate to index this 12 millions of rows?
EDIT:
Probably I will implement the most peasant solution.
Because BookId is cluster index in my table and select occurs very fast by BookId, I am going to find max BookId and going through all records and index all of them them.
for (long = 0; long < maxBookId; long++)
{
// get book by bookId
// if book exist, index it
}
If you have any other suggestion, please reply yo this question.
Instead of paging your whole data set, you could try to divide and conquer it. You said you had an index on book id, just change your criteria to return batches of books according to bounds of bookid :
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.Add(Restrictions.Gte("BookId", low))
.Add(Restrictions.Lt("BookId", high))
.List();
Where low and high are set like 0-1000, 1001-2000, etc

Resources