I have table that contains more than 12 millions of rows.
I need to index this rows using Lucene.NET (I need to perform initial indexing).
So I try to index in batch manner, by reading batch packets from sql (1000 rows per batch).
Here is how it looks:
public void BuildInitialBookSearchIndex()
{
FSDirectory directory = null;
IndexWriter writer = null;
var type = typeof(Book);
var info = new DirectoryInfo(GetIndexDirectory());
//if (info.Exists)
//{
// info.Delete(true);
//}
try
{
directory = FSDirectory.GetDirectory(Path.Combine(info.FullName, type.Name), true);
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
}
finally
{
if (directory != null)
{
directory.Close();
}
if (writer != null)
{
writer.Close();
}
}
var fullTextSession = Search.CreateFullTextSession(Session);
var currentIndex = 0;
const int batchSize = 1000;
while (true)
{
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = Session.BeginTransaction())
{
foreach (var entity in entities)
{
fullTextSession.Index(entity);
}
currentIndex += batchSize;
Session.Flush();
tx.Commit();
Session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
But, the operation times out when current index is bigger then 6-7 million. NHibernate Pagging throws time out.
Any suggestions, any other way in NHibernate to index this 12 millions of rows?
EDIT:
Probably I will implement the most peasant solution.
Because BookId is cluster index in my table and select occurs very fast by BookId, I am going to find max BookId and going through all records and index all of them them.
for (long = 0; long < maxBookId; long++)
{
// get book by bookId
// if book exist, index it
}
If you have any other suggestion, please reply yo this question.
Instead of paging your whole data set, you could try to divide and conquer it. You said you had an index on book id, just change your criteria to return batches of books according to bounds of bookid :
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.Add(Restrictions.Gte("BookId", low))
.Add(Restrictions.Lt("BookId", high))
.List();
Where low and high are set like 0-1000, 1001-2000, etc
Related
I am a Hibernate novice. I have the following code which persists a large number (say 10K) of rows from a List<String>:
#Override
#Transactional(readOnly = false)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {
StudyT study = studyDAO.getStudyByStudyId(studyId);
Authentication auth = SecurityContextHolder.getContext().getAuthentication();
for(String subjectId: subjectIds) { // LOOP with saveAndFlush() for each
// ...
user.setRoleTypeId(4);
user.setActiveFlag("Y");
user.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
user.setCreatedDate(new Date());
List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
StudyParticipantsT sp = new StudyParticipantsT();
sp.setStudyT(study);
sp.setUsersT(user);
sp.setSubjectId(subjectId);
sp.setLocked("N");
sp.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
sp.setCreatedDate(new Date());
participants.add(sp);
user.setStudyParticipantsTs(participants);
userDAO.saveAndFlush(user);
}
}
}
But this operation takes too long, about 5-10 min. for 10K rows. What is the proper way to improve this? Do I really need to rewrite the whole thing with a Batch Insert, or is there something simple I can tweak?
NOTE I also tried userDAO.save() without the Flush, and userDAO.flush() at the end outside the for-loop. But this didn't help, same bad performance.
We solved it. Batch-Inserts are done with saveAll. We define a batch size, say 1000, and saveAll the list and then reset. If at the end (an edge condition) we also save. This dramatically sped up all the inserts.
int batchSize = 1000;
// List for Batch-Inserts
List<UsersT> batchInsertUsers = new ArrayList<UsersT>();
for(int i = 0; i < subjectIds.size(); i++) {
String subjectId = subjectIds.get(i);
UsersT user = new UsersT();
// Fill out the object here...
// ...
// Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
batchInsertUsers.add(user);
if (batchInsertUsers.size() == maxBatchSize || i == subjectIds.size() - 1) {
userDAO.saveAll(batchInsertUsers);
// Reset list
batchInsertUsers.clear();
}
}
In my C# application i am using linq. I need a help what is the syntax for if-elseif- using linq in single line. Data, RangeDate are the inputs. Here is the code:
var Date1 = RangeData.ToList();
int record =0;
foreach (var tr in Date1)
{
int id =0;
if (tr.Item1 != null && tr.Item1.port != null)
{
id = tr.Item1.port.id;
}
else if (tr.Item2 != null && tr.Item2.port != null)
{
id = tr.Item2.port.id;
}
if (id >0)
{
if(Data.Trygetvalue(id, out cdat)
{
// Do some operation. (var cdata = SumData(id, tr.item2.port.Date)
record ++;
}
}
}
I think your code example is false, your record variable is initialized to 0 on each loop so increment it is useless .
I suppose that you want to count records in your list which have an id, you can achieve this with one single Count() :
var record = Date1.Count(o => (o.Item1?.port?.id ?? o.Item2?.port?.id) > 0);
You can use following code:
var count = RangeData.Select(x => new { Id = x.Item1?.port?.id ?? x.Item2?.port?.id ?? 0, Item = x })
.Count(x =>
{
int? cdate = null; // change int to your desired type over here
if (x.Id > 0 && Data.Trygetvalue(x.Id, out cdat))
{
// Do some operation. (var cdata = SumData(x.Id, x.Item.Item2.port.Date)
return true;
}
return false;
});
Edit:
#D Stanley is completely right, LINQ is wrong tool over here. You can refactor few bits of your code though:
var Date1 = RangeData.ToList();
int record =0;
foreach (var tr in Date1)
{
int? cdat = null; // change int to your desired type over here
int id = tr.Item1?.port?.id ?? tr.Item2?.port?.id ?? 0;
if (id >0 && Data.Trygetvalue(id, out cdat))
{
// Do some operation. (var cdata = SumData(id, tr.Item2.port.Date)
record ++;
}
}
Linq is not the right tool here. Linq is for converting or querying a collection. You are looping over a collection and "doing some operation". Depending on what that operation is, trying to shoehorn it into a Linq statement will be harder to understand to an outside reader, difficult to debug, and hard to maintain.
There is absolutely nothing wrong with the loop that you have. As you can tell from the other answers, it's difficult to wedge all of the information you have into a "single-line" statement just to use Linq.
I'm fairly new to Oracle but I have used the Bulk insert on a couple other applications. Most seem to go faster using it but I've had a couple where it slows down the application. This is my second one where it slowed it down significantly so I'm wondering if I have something setup incorrectly or maybe I need to set it up differently. In this case I have a console application that processed ~1,900 records. Inserting them individually it will take ~2.5 hours and when I switched over to the Bulk insert it jumped to 5 hours.
The article I based this off of is http://www.oracle.com/technetwork/issue-archive/2009/09-sep/o59odpnet-085168.html
Here is what I'm doing, I'm retrieving some records from the DB, do calculations, and then write the results out to a text file. After the calculations are done I have to write those results back to a different table in the DB so we can look back at what those calculations later on if needed.
When I make the calculation I add the results to a List. Once I'm done writing out the file I look at that List and if there are any records I do the bulk insert.
With the bulk insert I have a setting in the App.config to set the number of records I want to insert. In this case I'm using 250 records. I assumed it would be better to limit my in memory arrays to say 250 records versus the 1,900. I loop through that list to the count in the App.config and create an array for each column. Those arrays are then passed as parameters to Oracle.
App.config
<add key="UpdateBatchCount" value="250" />
Class
class EligibleHours
{
public string EmployeeID { get; set; }
public decimal Hours { get; set; }
public string HoursSource { get; set; }
}
Data Manager
public static void SaveEligibleHours(List<EligibleHours> listHours)
{
//set the number of records to update batch on from config file Subtract one because of 0 based index
int batchCount = int.Parse(ConfigurationManager.AppSettings["UpdateBatchCount"]);
//create the arrays to add values to
string[] arrEmployeeId = new string[batchCount];
decimal[] arrHours = new decimal[batchCount];
string[] arrHoursSource = new string[batchCount];
int i = 0;
foreach (var item in listHours)
{
//Create an array of employee numbers that will be used for a batch update.
//update after every X amount of records, update. Add 1 to i to compensated for 0 based indexing.
if (i + 1 <= batchCount)
{
arrEmployeeId[i] = item.EmployeeID;
arrHours[i] = item.Hours;
arrHoursSource[i] = item.HoursSource;
i++;
}
else
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
//reset counter and array
i = 0;
arrEmployeeId = new string[batchCount];
arrHours = new decimal[batchCount];
arrHoursSource = new string[batchCount];
}
}
//process last array
if (arrEmployeeId.Length > 0)
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
}
}
private static void UpdateDbWithEligibleHours(string[] arrEmployeeId, decimal[] arrHours, string[] arrHoursSource)
{
StringBuilder sbQuery = new StringBuilder();
sbQuery.Append("insert into ELIGIBLE_HOURS ");
sbQuery.Append("(EMP_ID, HOURS_SOURCE, TOT_ELIG_HRS, REPORT_DATE) ");
sbQuery.Append("values ");
sbQuery.Append("(:1, :2, :3, SYSDATE) ");
string connectionString = ConfigurationManager.ConnectionStrings["Server_Connection"].ToString();
using (OracleConnection dbConn = new OracleConnection(connectionString))
{
dbConn.Open();
//create Oracle parameters and pass arrays of data
OracleParameter p_employee_id = new OracleParameter();
p_employee_id.OracleDbType = OracleDbType.Char;
p_employee_id.Value = arrEmployeeId;
OracleParameter p_hoursSource = new OracleParameter();
p_hoursSource.OracleDbType = OracleDbType.Char;
p_hoursSource.Value = arrHoursSource;
OracleParameter p_hours = new OracleParameter();
p_hours.OracleDbType = OracleDbType.Decimal;
p_hours.Value = arrHours;
OracleCommand objCmd = dbConn.CreateCommand();
objCmd.CommandText = sbQuery.ToString();
objCmd.ArrayBindCount = arrEmployeeId.Length;
objCmd.Parameters.Add(p_employee_id);
objCmd.Parameters.Add(p_hoursSource);
objCmd.Parameters.Add(p_hours);
objCmd.ExecuteNonQuery();
}
}
I have a big table if I use a normal query it has a timeout exception. So I want to select top 1000 then output it, the next step is to retrieve from 1001 to 2000 and log it and so on.
I am not sure how to add a parameter in my query.
int pageNumer = 0;
var query = DBContext.MyTable.Where(c=>c.FacilityID == facilityID)
.OrderBy(c=>c.FilePath)
.Skip(pageNumer*1000)
.Take(1000);
foreach(var x in query)
{
// Console.WriteLine(x.Name);
}
// I want pageNumber is incremented until it goes to the bottom of the table.
// I don't know how many records in the table.
Try this out:
int pageNumber = 0;
bool hasHitEnd = false;
while (!hasHitEnd)
{
var query = DBContext.MyTable.Where(c=>c.FacilityID == facilityID)
.OrderBy(c=>c.FilePath)
.Skip(pageNumber*1000)
.Take(1000);
foreach(var x in query)
{
// Do something
}
if (query.Count < 1000)
{
hasHitEnd = true;
}
pageNumber++;
}
I am using SubSonic 3.0.0.3 along with the Linq T4 Templates. My ProjectRepository, for example, has the following two methods:
public int Add(Project item)
{
int result = 0;
ISqlQuery query = BuildInsertQuery(item);
if (query != null)
{
result = query.Execute();
}
return result;
}
private ISqlQuery BuildInsertQuery(Project item)
{
ITable tbl = FindTableByClassName();
Insert query = null;
if (tbl != null)
{
Dictionary<string, object> hashed = item.ToDictionary();
query = new Insert(_db.Provider).Into<Project>(tbl);
foreach (string key in hashed.Keys)
{
IColumn col = tbl.GetColumn(key);
if (col != null)
{
if (!col.AutoIncrement)
{
query.Value(key, hashed[key]);
}
}
}
}
return query;
}
Along with performing the insert (which works great), I'd really like to get the value of the auto-incrementing ProjectId column. For the record, this column is both the primary key and identity column. Is there perhaps a way to append "SELECT SCOPE_IDENTITY();" to the query or maybe there's an entirely different approach which I should try?
You can do this with the ActiveRecord templates which does all of the wiring above for you (and also has built-in testing). In your scenario, the Add method would have one line: Project.Add() and it would return the new id.
For your needs, you can try this:
var cmd=query.GetCommand();
cmd.CommandSql+=";SELECT SCOPE_IDENTITY() as newid";
var newID=query.Provider.ExecuteScalar(cmd);
That should work..
*Edit - you can create an ExtensionMethod for this on ISqlQuery too, to save some writing...