Spring Data Neo4j Ridiculously Slow Over Rest - spring

public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?

If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult

Related

Spring GCP Datastore : Batch processing error: Binding site #limit for limit bound to non-integer value parameter., code=INVALID_ARGUMENT

I have 5000 entity records in my GCP DataStore, if I use repo.findAll(), it takes 45 secs to fetch the results, below is the one liner code:
Iterable<StoreCache> storeCaches = your_kindRepository.findAll();
So I thought of using the pagination feature to fetch 25 records at a time but I am getting below run time error while running my code at line "repo.findAllSlice(DatastorePageable.of(page, 25));"
com.google.datastore.v1.client.DatastoreException: Binding site #limit for limit bound to non-integer value parameter., code=INVALID_ARGUMENT
This is my repo code:
#Repository
#Transactional
public interface your_kindRepository extends DatastoreRepository<your_kind, Long> {
#Query("select * from your_kind")
Slice<TestEntity> findAllSlice(Pageable pageable);
This is my service class code:
LOGGER.info("start");
int page = 0;
Slice<TestEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAllSlice(DatastorePageable.of(page, 25));
} else {
slice = repo.findAllSlice(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
LOGGER.info("processed: " + page);
page++;
}
LOGGER.info("end");

Save millions of rows from CSV to Oracle DB Using Spring boot JPA

On regular Basis another application dumps a CSV that contains more than 7-8 millions of rows. I have a cron job that loads the data from CSV ans saves the data into my oracle DB. Here's my code snippet
String line = "";
int count = 0;
LocalDate localDateTime;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
List<ItemizedBill> itemizedBills = new ArrayList<>();
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= line.split("\\|");
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
localDateTime = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDateTime.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) { customer.setFnfNum("Other"); }
else{ customer.setFnfNum("FNF Number"); }
itemizedBills.add(customer);
}
count++;
}
itemizedBillRepository.saveAll(itemizedBills);
} catch (IOException e) {
e.printStackTrace();
}
}
This feature works but takes a lot of time to process. How can I make it efficent and make this process faster?
There are a couple of things you should do to your code.
String.split, while convenient, is relatively slow as it will recompile the regexp each time. Better to use Pattern and the split method on that to reduce overhead.
Use proper JPA batching strategies as explained in this blog.
First enable batch processing in your Spring application.properties. We will use a batch size of 50 (you will need to experiment on what is a proper batch-size for your case).
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
Then directly save entities to the database and each 50 items do a flush and clear. This will flush the state to the database and clear the first level cache (which will prevent excessive dirty-checks).
With all the above your code should look something like this.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
LocalDate localDate = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
2 other optimizations you could do:
If your volume property is a long rather then a Long you should use Long.parseLong(data[4]); instead. It saves the Long creation and unboxing. With just 10 rows this might not be an issue, but with millions of rows, those milliseconds will add up.
Use ddMMMyy as the DateTimeFormatter and remove the substring parts in your code. Just do LocalDate.parse(date[1].toUpperCase(), formatted) to achieve the same result without the additional overhead of 5 additional String objects.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("ddMMMyy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
LocalDate localDate = LocalDate.parse(data[1].toUpperCase(), formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.parseLong(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
you can use spring data batch insert.This links explains how to do : https://www.baeldung.com/spring-data-jpa-batch-inserts
You can try streaming MySQL results using Java 8 Streams and Spring Data JPA. The below link explains it in details
http://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html

Hibernate saveAndFlush() takes a long time for 10K By-Row Inserts

I am a Hibernate novice. I have the following code which persists a large number (say 10K) of rows from a List<String>:
#Override
#Transactional(readOnly = false)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {
StudyT study = studyDAO.getStudyByStudyId(studyId);
Authentication auth = SecurityContextHolder.getContext().getAuthentication();
for(String subjectId: subjectIds) { // LOOP with saveAndFlush() for each
// ...
user.setRoleTypeId(4);
user.setActiveFlag("Y");
user.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
user.setCreatedDate(new Date());
List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
StudyParticipantsT sp = new StudyParticipantsT();
sp.setStudyT(study);
sp.setUsersT(user);
sp.setSubjectId(subjectId);
sp.setLocked("N");
sp.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
sp.setCreatedDate(new Date());
participants.add(sp);
user.setStudyParticipantsTs(participants);
userDAO.saveAndFlush(user);
}
}
}
But this operation takes too long, about 5-10 min. for 10K rows. What is the proper way to improve this? Do I really need to rewrite the whole thing with a Batch Insert, or is there something simple I can tweak?
NOTE I also tried userDAO.save() without the Flush, and userDAO.flush() at the end outside the for-loop. But this didn't help, same bad performance.
We solved it. Batch-Inserts are done with saveAll. We define a batch size, say 1000, and saveAll the list and then reset. If at the end (an edge condition) we also save. This dramatically sped up all the inserts.
int batchSize = 1000;
// List for Batch-Inserts
List<UsersT> batchInsertUsers = new ArrayList<UsersT>();
for(int i = 0; i < subjectIds.size(); i++) {
String subjectId = subjectIds.get(i);
UsersT user = new UsersT();
// Fill out the object here...
// ...
// Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
batchInsertUsers.add(user);
if (batchInsertUsers.size() == maxBatchSize || i == subjectIds.size() - 1) {
userDAO.saveAll(batchInsertUsers);
// Reset list
batchInsertUsers.clear();
}
}

Ehcache & multi-threading: how to lock when inserting to the cache?

Let's suppose I have a multi-threading application with 4 threads which share one (Eh)cache; the cache stores UserProfile objects in order to avoid fetching them from the database every time.
Now, let's say all these 4 threads request the same UserProfile with ID=123 at the same moment - and it hasn't been cached yet. What has to be done is to query the database and insert obtained UserProfile object into the cache so it could be reused later.
However, what I want to achieve is that only one of these threads (the first one) queries the database and updates the cache, while the other 3 wait (queue) for it to finish... and then get the UserProfile object with ID=123 directly from cache.
How do you usually implement such scenario? Using Ehcache's locking/transactions? Or rather through something like this? (pseudo-code)
public UserProfile getUserProfile(int id) {
result = ehcache.get(id)
if (result == null) { // not cached yet
synchronized { // queue threads
result = ehcache.get(id)
if (result == null) { // is current thread the 1st one?
result = database.fetchUserProfile(id)
ehcache.put(id, result)
}
}
}
return result
}
This is called a Thundering Herd problem.
Locking works but it's really efficient because the lock is broader than what you would like. You could lock on a single ID.
You can do 2 things. One is to use a CacheLoaderWriter. It will load the missing entry and perform the lock at the right granularity. This is the easiest solution even though you have to implement a loader-writer.
The alternative is more involved. You need some kind of row-locking algorithm. For example, you could do something like this:
private final ReentrantLock locks = new ReentrantLocks[1024];
{
for(int i = 0; i < locks.length; i)) {
locks[i] = new ReentrantLock();
}
}
public UserProfile getUserProfile(int id) {
result = ehcache.get(id)
if (result == null) { // not cached yet
ReentrantLock lock = locks[id % locks.length];
lock.lock();
try {
result = ehcache.get(id)
if (result == null) { // is current thread the 1st one?
result = database.fetchUserProfile(id)
ehcache.put(id, result)
}
} finally {
lock.unlock();
}
}
return result
}
Use a plain java object lock :
private static final Object LOCK = new Object();
synchronized (LOCK) {
result = ehcache.get(id);
if ( result == null || ehcache.isExpired() ) {
// cache is expired or null so going to DB
result = database.fetchUserProfile(id);
ehcache.put(id, result)
}
}

Using Bulk Insert dramatically slows down processing?

I'm fairly new to Oracle but I have used the Bulk insert on a couple other applications. Most seem to go faster using it but I've had a couple where it slows down the application. This is my second one where it slowed it down significantly so I'm wondering if I have something setup incorrectly or maybe I need to set it up differently. In this case I have a console application that processed ~1,900 records. Inserting them individually it will take ~2.5 hours and when I switched over to the Bulk insert it jumped to 5 hours.
The article I based this off of is http://www.oracle.com/technetwork/issue-archive/2009/09-sep/o59odpnet-085168.html
Here is what I'm doing, I'm retrieving some records from the DB, do calculations, and then write the results out to a text file. After the calculations are done I have to write those results back to a different table in the DB so we can look back at what those calculations later on if needed.
When I make the calculation I add the results to a List. Once I'm done writing out the file I look at that List and if there are any records I do the bulk insert.
With the bulk insert I have a setting in the App.config to set the number of records I want to insert. In this case I'm using 250 records. I assumed it would be better to limit my in memory arrays to say 250 records versus the 1,900. I loop through that list to the count in the App.config and create an array for each column. Those arrays are then passed as parameters to Oracle.
App.config
<add key="UpdateBatchCount" value="250" />
Class
class EligibleHours
{
public string EmployeeID { get; set; }
public decimal Hours { get; set; }
public string HoursSource { get; set; }
}
Data Manager
public static void SaveEligibleHours(List<EligibleHours> listHours)
{
//set the number of records to update batch on from config file Subtract one because of 0 based index
int batchCount = int.Parse(ConfigurationManager.AppSettings["UpdateBatchCount"]);
//create the arrays to add values to
string[] arrEmployeeId = new string[batchCount];
decimal[] arrHours = new decimal[batchCount];
string[] arrHoursSource = new string[batchCount];
int i = 0;
foreach (var item in listHours)
{
//Create an array of employee numbers that will be used for a batch update.
//update after every X amount of records, update. Add 1 to i to compensated for 0 based indexing.
if (i + 1 <= batchCount)
{
arrEmployeeId[i] = item.EmployeeID;
arrHours[i] = item.Hours;
arrHoursSource[i] = item.HoursSource;
i++;
}
else
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
//reset counter and array
i = 0;
arrEmployeeId = new string[batchCount];
arrHours = new decimal[batchCount];
arrHoursSource = new string[batchCount];
}
}
//process last array
if (arrEmployeeId.Length > 0)
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
}
}
private static void UpdateDbWithEligibleHours(string[] arrEmployeeId, decimal[] arrHours, string[] arrHoursSource)
{
StringBuilder sbQuery = new StringBuilder();
sbQuery.Append("insert into ELIGIBLE_HOURS ");
sbQuery.Append("(EMP_ID, HOURS_SOURCE, TOT_ELIG_HRS, REPORT_DATE) ");
sbQuery.Append("values ");
sbQuery.Append("(:1, :2, :3, SYSDATE) ");
string connectionString = ConfigurationManager.ConnectionStrings["Server_Connection"].ToString();
using (OracleConnection dbConn = new OracleConnection(connectionString))
{
dbConn.Open();
//create Oracle parameters and pass arrays of data
OracleParameter p_employee_id = new OracleParameter();
p_employee_id.OracleDbType = OracleDbType.Char;
p_employee_id.Value = arrEmployeeId;
OracleParameter p_hoursSource = new OracleParameter();
p_hoursSource.OracleDbType = OracleDbType.Char;
p_hoursSource.Value = arrHoursSource;
OracleParameter p_hours = new OracleParameter();
p_hours.OracleDbType = OracleDbType.Decimal;
p_hours.Value = arrHours;
OracleCommand objCmd = dbConn.CreateCommand();
objCmd.CommandText = sbQuery.ToString();
objCmd.ArrayBindCount = arrEmployeeId.Length;
objCmd.Parameters.Add(p_employee_id);
objCmd.Parameters.Add(p_hoursSource);
objCmd.Parameters.Add(p_hours);
objCmd.ExecuteNonQuery();
}
}

Resources