Hibernate saveAndFlush() takes a long time for 10K By-Row Inserts - spring

I am a Hibernate novice. I have the following code which persists a large number (say 10K) of rows from a List<String>:
#Override
#Transactional(readOnly = false)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {
StudyT study = studyDAO.getStudyByStudyId(studyId);
Authentication auth = SecurityContextHolder.getContext().getAuthentication();
for(String subjectId: subjectIds) { // LOOP with saveAndFlush() for each
// ...
user.setRoleTypeId(4);
user.setActiveFlag("Y");
user.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
user.setCreatedDate(new Date());
List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
StudyParticipantsT sp = new StudyParticipantsT();
sp.setStudyT(study);
sp.setUsersT(user);
sp.setSubjectId(subjectId);
sp.setLocked("N");
sp.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
sp.setCreatedDate(new Date());
participants.add(sp);
user.setStudyParticipantsTs(participants);
userDAO.saveAndFlush(user);
}
}
}
But this operation takes too long, about 5-10 min. for 10K rows. What is the proper way to improve this? Do I really need to rewrite the whole thing with a Batch Insert, or is there something simple I can tweak?
NOTE I also tried userDAO.save() without the Flush, and userDAO.flush() at the end outside the for-loop. But this didn't help, same bad performance.

We solved it. Batch-Inserts are done with saveAll. We define a batch size, say 1000, and saveAll the list and then reset. If at the end (an edge condition) we also save. This dramatically sped up all the inserts.
int batchSize = 1000;
// List for Batch-Inserts
List<UsersT> batchInsertUsers = new ArrayList<UsersT>();
for(int i = 0; i < subjectIds.size(); i++) {
String subjectId = subjectIds.get(i);
UsersT user = new UsersT();
// Fill out the object here...
// ...
// Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
batchInsertUsers.add(user);
if (batchInsertUsers.size() == maxBatchSize || i == subjectIds.size() - 1) {
userDAO.saveAll(batchInsertUsers);
// Reset list
batchInsertUsers.clear();
}
}

Related

Ehcache & multi-threading: how to lock when inserting to the cache?

Let's suppose I have a multi-threading application with 4 threads which share one (Eh)cache; the cache stores UserProfile objects in order to avoid fetching them from the database every time.
Now, let's say all these 4 threads request the same UserProfile with ID=123 at the same moment - and it hasn't been cached yet. What has to be done is to query the database and insert obtained UserProfile object into the cache so it could be reused later.
However, what I want to achieve is that only one of these threads (the first one) queries the database and updates the cache, while the other 3 wait (queue) for it to finish... and then get the UserProfile object with ID=123 directly from cache.
How do you usually implement such scenario? Using Ehcache's locking/transactions? Or rather through something like this? (pseudo-code)
public UserProfile getUserProfile(int id) {
result = ehcache.get(id)
if (result == null) { // not cached yet
synchronized { // queue threads
result = ehcache.get(id)
if (result == null) { // is current thread the 1st one?
result = database.fetchUserProfile(id)
ehcache.put(id, result)
}
}
}
return result
}
This is called a Thundering Herd problem.
Locking works but it's really efficient because the lock is broader than what you would like. You could lock on a single ID.
You can do 2 things. One is to use a CacheLoaderWriter. It will load the missing entry and perform the lock at the right granularity. This is the easiest solution even though you have to implement a loader-writer.
The alternative is more involved. You need some kind of row-locking algorithm. For example, you could do something like this:
private final ReentrantLock locks = new ReentrantLocks[1024];
{
for(int i = 0; i < locks.length; i)) {
locks[i] = new ReentrantLock();
}
}
public UserProfile getUserProfile(int id) {
result = ehcache.get(id)
if (result == null) { // not cached yet
ReentrantLock lock = locks[id % locks.length];
lock.lock();
try {
result = ehcache.get(id)
if (result == null) { // is current thread the 1st one?
result = database.fetchUserProfile(id)
ehcache.put(id, result)
}
} finally {
lock.unlock();
}
}
return result
}
Use a plain java object lock :
private static final Object LOCK = new Object();
synchronized (LOCK) {
result = ehcache.get(id);
if ( result == null || ehcache.isExpired() ) {
// cache is expired or null so going to DB
result = database.fetchUserProfile(id);
ehcache.put(id, result)
}
}

Using Bulk Insert dramatically slows down processing?

I'm fairly new to Oracle but I have used the Bulk insert on a couple other applications. Most seem to go faster using it but I've had a couple where it slows down the application. This is my second one where it slowed it down significantly so I'm wondering if I have something setup incorrectly or maybe I need to set it up differently. In this case I have a console application that processed ~1,900 records. Inserting them individually it will take ~2.5 hours and when I switched over to the Bulk insert it jumped to 5 hours.
The article I based this off of is http://www.oracle.com/technetwork/issue-archive/2009/09-sep/o59odpnet-085168.html
Here is what I'm doing, I'm retrieving some records from the DB, do calculations, and then write the results out to a text file. After the calculations are done I have to write those results back to a different table in the DB so we can look back at what those calculations later on if needed.
When I make the calculation I add the results to a List. Once I'm done writing out the file I look at that List and if there are any records I do the bulk insert.
With the bulk insert I have a setting in the App.config to set the number of records I want to insert. In this case I'm using 250 records. I assumed it would be better to limit my in memory arrays to say 250 records versus the 1,900. I loop through that list to the count in the App.config and create an array for each column. Those arrays are then passed as parameters to Oracle.
App.config
<add key="UpdateBatchCount" value="250" />
Class
class EligibleHours
{
public string EmployeeID { get; set; }
public decimal Hours { get; set; }
public string HoursSource { get; set; }
}
Data Manager
public static void SaveEligibleHours(List<EligibleHours> listHours)
{
//set the number of records to update batch on from config file Subtract one because of 0 based index
int batchCount = int.Parse(ConfigurationManager.AppSettings["UpdateBatchCount"]);
//create the arrays to add values to
string[] arrEmployeeId = new string[batchCount];
decimal[] arrHours = new decimal[batchCount];
string[] arrHoursSource = new string[batchCount];
int i = 0;
foreach (var item in listHours)
{
//Create an array of employee numbers that will be used for a batch update.
//update after every X amount of records, update. Add 1 to i to compensated for 0 based indexing.
if (i + 1 <= batchCount)
{
arrEmployeeId[i] = item.EmployeeID;
arrHours[i] = item.Hours;
arrHoursSource[i] = item.HoursSource;
i++;
}
else
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
//reset counter and array
i = 0;
arrEmployeeId = new string[batchCount];
arrHours = new decimal[batchCount];
arrHoursSource = new string[batchCount];
}
}
//process last array
if (arrEmployeeId.Length > 0)
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
}
}
private static void UpdateDbWithEligibleHours(string[] arrEmployeeId, decimal[] arrHours, string[] arrHoursSource)
{
StringBuilder sbQuery = new StringBuilder();
sbQuery.Append("insert into ELIGIBLE_HOURS ");
sbQuery.Append("(EMP_ID, HOURS_SOURCE, TOT_ELIG_HRS, REPORT_DATE) ");
sbQuery.Append("values ");
sbQuery.Append("(:1, :2, :3, SYSDATE) ");
string connectionString = ConfigurationManager.ConnectionStrings["Server_Connection"].ToString();
using (OracleConnection dbConn = new OracleConnection(connectionString))
{
dbConn.Open();
//create Oracle parameters and pass arrays of data
OracleParameter p_employee_id = new OracleParameter();
p_employee_id.OracleDbType = OracleDbType.Char;
p_employee_id.Value = arrEmployeeId;
OracleParameter p_hoursSource = new OracleParameter();
p_hoursSource.OracleDbType = OracleDbType.Char;
p_hoursSource.Value = arrHoursSource;
OracleParameter p_hours = new OracleParameter();
p_hours.OracleDbType = OracleDbType.Decimal;
p_hours.Value = arrHours;
OracleCommand objCmd = dbConn.CreateCommand();
objCmd.CommandText = sbQuery.ToString();
objCmd.ArrayBindCount = arrEmployeeId.Length;
objCmd.Parameters.Add(p_employee_id);
objCmd.Parameters.Add(p_hoursSource);
objCmd.Parameters.Add(p_hours);
objCmd.ExecuteNonQuery();
}
}

Spring Data Neo4j Ridiculously Slow Over Rest

public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?
If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult

hbase InternalScanner and filter in coprocessor

all:
Recently,I wrote a coprocessor in Hbase(0.94.17), A Class extends BaseEndpointCoprocessor, a rowcount method to count one table's rows.
And I got a problem.
if I did not set a filter in scan,my code works fine for two tables. One table has 1,000,000 rows,the other has 160,000,000 rows. it took about 2 minutes to count the bigger table.
however ,If I set a filter in scan, it only work on small table. it will throw a exception on the bigger table.
org.apache.hadoop.hbase.ipc.ExecRPCInvoker$1#2c88652b, java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
trust me,I check my code over and over again.
so, to count my table with filter, I have to write the following stupid code, first, I did not set a filter in scan,and then ,after I got one row record, I wrote a method to filter it.
and it work on both tables.
But I do not know why.
I try to read the scanner source code in HRegion.java,however, I did not get it.
So,if you know the answer,please help me. Thank you.
#Override
public long rowCount(Configuration conf) throws IOException {
// TODO Auto-generated method stub
Scan scan = new Scan();
parseConfiguration(conf);
Filter filter = null;
if (this.mFilterString != null && !mFilterString.equals("")) {
ParseFilter parse = new ParseFilter();
filter = parse.parseFilterString(mFilterString);
// scan.setFilter(filter);
}
scan.setCaching(this.mScanCaching);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);
long sum = 0;
try {
List<KeyValue> curVals = new ArrayList<KeyValue>();
boolean hasMore = false;
do {
curVals.clear();
hasMore = scanner.next(curVals);
if (filter != null) {
filter.reset();
if (HbaseUtil.filterOneResult(curVals, filter)) {
continue;
}
}
sum++;
} while (hasMore);
} finally {
scanner.close();
}
return sum;
}
The following is my hbase util code:
public static boolean filterOneResult(List<KeyValue> kvList, Filter filter) {
if (kvList.size() == 0)
return true;
KeyValue kv = kvList.get(0);
if (filter.filterRowKey(kv.getBuffer(), kv.getRowOffset(), kv.getRowLength())) {
return true;
}
for (KeyValue kv2 : kvList) {
if (filter.filterKeyValue(kv2) == Filter.ReturnCode.NEXT_ROW) {
return true;
}
}
filter.filterRow(kvList);
if (filter.filterRow())
return true;
else
return false;
}
Ok,It was my mistake. After I use jdb to debug my code, I got the following exception,
"org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
It is obvious ,my result list is empty.
hasMore = scanner.next(curVals);
it means, if I use a Filter in scan,this curVals list might be empty, but hasMore is true.
but I thought,if a record was filtered, it should jump to the next row,and this list should never be empty. I was wrong.
And my client did not print any remote error message on my console, it just catch this remote Exception, and retry.
after retry 10 times, it print an another exception,which was meaningless.

Nhibernate paging performance

I have table that contains more than 12 millions of rows.
I need to index this rows using Lucene.NET (I need to perform initial indexing).
So I try to index in batch manner, by reading batch packets from sql (1000 rows per batch).
Here is how it looks:
public void BuildInitialBookSearchIndex()
{
FSDirectory directory = null;
IndexWriter writer = null;
var type = typeof(Book);
var info = new DirectoryInfo(GetIndexDirectory());
//if (info.Exists)
//{
// info.Delete(true);
//}
try
{
directory = FSDirectory.GetDirectory(Path.Combine(info.FullName, type.Name), true);
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
}
finally
{
if (directory != null)
{
directory.Close();
}
if (writer != null)
{
writer.Close();
}
}
var fullTextSession = Search.CreateFullTextSession(Session);
var currentIndex = 0;
const int batchSize = 1000;
while (true)
{
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = Session.BeginTransaction())
{
foreach (var entity in entities)
{
fullTextSession.Index(entity);
}
currentIndex += batchSize;
Session.Flush();
tx.Commit();
Session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
But, the operation times out when current index is bigger then 6-7 million. NHibernate Pagging throws time out.
Any suggestions, any other way in NHibernate to index this 12 millions of rows?
EDIT:
Probably I will implement the most peasant solution.
Because BookId is cluster index in my table and select occurs very fast by BookId, I am going to find max BookId and going through all records and index all of them them.
for (long = 0; long < maxBookId; long++)
{
// get book by bookId
// if book exist, index it
}
If you have any other suggestion, please reply yo this question.
Instead of paging your whole data set, you could try to divide and conquer it. You said you had an index on book id, just change your criteria to return batches of books according to bounds of bookid :
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.Add(Restrictions.Gte("BookId", low))
.Add(Restrictions.Lt("BookId", high))
.List();
Where low and high are set like 0-1000, 1001-2000, etc

Resources