Using Bulk Insert dramatically slows down processing? - oracle

I'm fairly new to Oracle but I have used the Bulk insert on a couple other applications. Most seem to go faster using it but I've had a couple where it slows down the application. This is my second one where it slowed it down significantly so I'm wondering if I have something setup incorrectly or maybe I need to set it up differently. In this case I have a console application that processed ~1,900 records. Inserting them individually it will take ~2.5 hours and when I switched over to the Bulk insert it jumped to 5 hours.
The article I based this off of is http://www.oracle.com/technetwork/issue-archive/2009/09-sep/o59odpnet-085168.html
Here is what I'm doing, I'm retrieving some records from the DB, do calculations, and then write the results out to a text file. After the calculations are done I have to write those results back to a different table in the DB so we can look back at what those calculations later on if needed.
When I make the calculation I add the results to a List. Once I'm done writing out the file I look at that List and if there are any records I do the bulk insert.
With the bulk insert I have a setting in the App.config to set the number of records I want to insert. In this case I'm using 250 records. I assumed it would be better to limit my in memory arrays to say 250 records versus the 1,900. I loop through that list to the count in the App.config and create an array for each column. Those arrays are then passed as parameters to Oracle.
App.config
<add key="UpdateBatchCount" value="250" />
Class
class EligibleHours
{
public string EmployeeID { get; set; }
public decimal Hours { get; set; }
public string HoursSource { get; set; }
}
Data Manager
public static void SaveEligibleHours(List<EligibleHours> listHours)
{
//set the number of records to update batch on from config file Subtract one because of 0 based index
int batchCount = int.Parse(ConfigurationManager.AppSettings["UpdateBatchCount"]);
//create the arrays to add values to
string[] arrEmployeeId = new string[batchCount];
decimal[] arrHours = new decimal[batchCount];
string[] arrHoursSource = new string[batchCount];
int i = 0;
foreach (var item in listHours)
{
//Create an array of employee numbers that will be used for a batch update.
//update after every X amount of records, update. Add 1 to i to compensated for 0 based indexing.
if (i + 1 <= batchCount)
{
arrEmployeeId[i] = item.EmployeeID;
arrHours[i] = item.Hours;
arrHoursSource[i] = item.HoursSource;
i++;
}
else
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
//reset counter and array
i = 0;
arrEmployeeId = new string[batchCount];
arrHours = new decimal[batchCount];
arrHoursSource = new string[batchCount];
}
}
//process last array
if (arrEmployeeId.Length > 0)
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
}
}
private static void UpdateDbWithEligibleHours(string[] arrEmployeeId, decimal[] arrHours, string[] arrHoursSource)
{
StringBuilder sbQuery = new StringBuilder();
sbQuery.Append("insert into ELIGIBLE_HOURS ");
sbQuery.Append("(EMP_ID, HOURS_SOURCE, TOT_ELIG_HRS, REPORT_DATE) ");
sbQuery.Append("values ");
sbQuery.Append("(:1, :2, :3, SYSDATE) ");
string connectionString = ConfigurationManager.ConnectionStrings["Server_Connection"].ToString();
using (OracleConnection dbConn = new OracleConnection(connectionString))
{
dbConn.Open();
//create Oracle parameters and pass arrays of data
OracleParameter p_employee_id = new OracleParameter();
p_employee_id.OracleDbType = OracleDbType.Char;
p_employee_id.Value = arrEmployeeId;
OracleParameter p_hoursSource = new OracleParameter();
p_hoursSource.OracleDbType = OracleDbType.Char;
p_hoursSource.Value = arrHoursSource;
OracleParameter p_hours = new OracleParameter();
p_hours.OracleDbType = OracleDbType.Decimal;
p_hours.Value = arrHours;
OracleCommand objCmd = dbConn.CreateCommand();
objCmd.CommandText = sbQuery.ToString();
objCmd.ArrayBindCount = arrEmployeeId.Length;
objCmd.Parameters.Add(p_employee_id);
objCmd.Parameters.Add(p_hoursSource);
objCmd.Parameters.Add(p_hours);
objCmd.ExecuteNonQuery();
}
}

Related

Hibernate saveAndFlush() takes a long time for 10K By-Row Inserts

I am a Hibernate novice. I have the following code which persists a large number (say 10K) of rows from a List<String>:
#Override
#Transactional(readOnly = false)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {
StudyT study = studyDAO.getStudyByStudyId(studyId);
Authentication auth = SecurityContextHolder.getContext().getAuthentication();
for(String subjectId: subjectIds) { // LOOP with saveAndFlush() for each
// ...
user.setRoleTypeId(4);
user.setActiveFlag("Y");
user.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
user.setCreatedDate(new Date());
List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
StudyParticipantsT sp = new StudyParticipantsT();
sp.setStudyT(study);
sp.setUsersT(user);
sp.setSubjectId(subjectId);
sp.setLocked("N");
sp.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
sp.setCreatedDate(new Date());
participants.add(sp);
user.setStudyParticipantsTs(participants);
userDAO.saveAndFlush(user);
}
}
}
But this operation takes too long, about 5-10 min. for 10K rows. What is the proper way to improve this? Do I really need to rewrite the whole thing with a Batch Insert, or is there something simple I can tweak?
NOTE I also tried userDAO.save() without the Flush, and userDAO.flush() at the end outside the for-loop. But this didn't help, same bad performance.
We solved it. Batch-Inserts are done with saveAll. We define a batch size, say 1000, and saveAll the list and then reset. If at the end (an edge condition) we also save. This dramatically sped up all the inserts.
int batchSize = 1000;
// List for Batch-Inserts
List<UsersT> batchInsertUsers = new ArrayList<UsersT>();
for(int i = 0; i < subjectIds.size(); i++) {
String subjectId = subjectIds.get(i);
UsersT user = new UsersT();
// Fill out the object here...
// ...
// Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
batchInsertUsers.add(user);
if (batchInsertUsers.size() == maxBatchSize || i == subjectIds.size() - 1) {
userDAO.saveAll(batchInsertUsers);
// Reset list
batchInsertUsers.clear();
}
}

LINQ way to change my inefficient program

Please consider the following records
I'm trying to flatten and group the data by Robot Name then by date + Left Factory time then group the address(es) for that date and time. Notice that some of the Left Factory times are identical.
I wrote the code below and it works. It gives me the output that I want. I was a Perl developer so what you see below is from that mentality. I'm sure there is a better way of doing it in LINQ. A little help please.
static void Main(string[] args)
{
if (args.Length < 0){
Console.WriteLine("Input file name is required");
return;
}
List<string> rawlst = File.ReadAllLines(args[0]).ToList<string>();
Dictionary<string, Dictionary<DateTime, List<string>>> dicDriver = new Dictionary<string, Dictionary<DateTime, List<string>>>();
foreach (string ln in rawlst)
{
try
{
List<string> parts = new List<string>();
parts = ln.Split(',').ToList<string>();
string[] dtparts = parts[1].Split('/');
string[] dttime = parts[15].Split(':');
DateTime dtrow = new DateTime(
int.Parse(dtparts[2]), int.Parse(dtparts[0]), int.Parse(dtparts[1]),
int.Parse(dttime[0]), int.Parse(dttime[1]), int.Parse(dttime[2]));
string rowAddress = parts[7] + " " + parts[9] + " " + parts[10] + " " + parts[11];
if (!dicDriver.Keys.Contains(parts[3]))
{
Dictionary<DateTime, List<string>> thisRec = new Dictionary<DateTime, List<string>>();
thisRec.Add(dtrow, new List<string>() { rowAddress });
dicDriver.Add(parts[3], thisRec);
}
else
{
Dictionary<DateTime, List<string>> thisDriver = new Dictionary<DateTime, List<string>>();
thisDriver = dicDriver[parts[3]];
if (!thisDriver.Keys.Contains(dtrow))
{
dicDriver[parts[3]].Add(dtrow, new List<string>() { rowAddress });
}
else
{
dicDriver[parts[3]][dtrow].Add(rowAddress);
}
}
}
catch (Exception e)
{
Console.WriteLine("ERROR:" + ln);
}
}
//output
string filename = DateTime.Now.Ticks.ToString() + ".out";
foreach (var name in dicDriver.Keys)
{
foreach (var dd in dicDriver[name])
{
Console.Write(name + "," + dd.Key + ",");
File.AppendAllText(filename, name + "," + dd.Key + Environment.NewLine);
foreach (var addr in dd.Value)
{
Console.Write("\t\t" + addr + Environment.NewLine);
File.AppendAllText(filename, "\t" + addr + Environment.NewLine);
}
}
Console.Write(Environment.NewLine);
File.AppendAllText(filename, Environment.NewLine);
}
Console.ReadLine();
}
You should separate your concerns: separate your input from the processing and from the output.
For example: suppose you would have to read your input from a database instead from a CSV file? Would that seriously change the way your process your fetched data? In your design, fetching the data is mixed with processing: although you know that the data that you want to process contains something like FactoryProcesses, you decide to present eache FactoryProcess as a string. A FactoryProcess is not a string. It describes how and when and who processed something in your factory. That is not a string, is it? However, it might be represented internally as a string, but the outside world should not know this. This way, if you change your FactoryProcess from being read by a CSV-file, to something provided by a database, the users of your FactoryProcess won't see any difference.
Separation of concerns makes your code easier to understand, easier to test, easier to change, and better to re-use.
So let's separate!
IEnumerable<FactoryProcess> ReadFactoryProcesses(string fileName)
{
// TODO: check fileName not null, file exists
using (var fileReader = new StreamReader(fileName))
{
// read the file Line by Line and split each line into one FactoryProcess object
string line = fileReader.ReadLine();
while (line != null)
{
// one line read, convert to FactoryProcess and yield return:
FactoryProcess factoryProcess = this.ToFactoryProcess(line);
yield return factoryProcess;
// read next line:
line = fileReader.ReadLine();
}
}
}
I'll leave the conversion of a read line into a FactoryProcess up to you. Tip: if the items in your lines are separated by a comma, or something similar, consider using Nuget Package CSVHelper. It makes if easier to convert a file into a sequence of FactoryProcesses.
I want to group the data by Robot Name then by date + Left Factory time then group the address(es) for that date and time.
First of all: make sure that the FactoryProcess class has the properties you actually need. Separate this representation from what it is in a file. Apparently you want to tread date + left factory as one item that represents the Date and Time that it left the factory. So Let's create a DateTime property for this.
class FactoryProcess
{
public int Id {get; set}
public int PartNo {get; set;}
public string RobotName {get; set;} // or if desired: use a unique RobotId
...
// DateTimes: ArrivalTime, OutOfFactoryTime, LeftFactoryTime
public DateTime ArrivalTime {get; set;}
public DateTime OutOfFactoryTime {get; set;}
public DateTime LeftFactoryTime {get; set;}
}
The reason that I put Date and Time into one DateTime, is because it will solve problems if an item arrives on 23:55 and leaves on 00:05 next day.
A procedure that converts a read CSV-line to a FactoryProcess should interpret your dates and times as strings and convert to FactoryProcess. You can create a constrcuctor for this, or a special Factory class
public FactoryProcess InterpretReadLine(string line)
{
// TODO: separate the parts, such that you've got the strings dateTxt, arrivalTimeTxt, ...
DateTime date = DateTime.Parse(dateTxt);
TimeSpan arrivalTime = TimeSpan.Parse(arrivalTimeTxt);
TimeSpan outOfFactoryTime = TimeSpan.Parse(outOfFactoryTimeTxt);
TimeSpan leftFactoryTime = TimeSpan.Parse(leftFactoryTimeTxt);
return new FactoryProces
{
Id = ...
PartNo = ..
RobotName = ...
// The DateTimes:
ArrivalTime = date + arrivalTime,
OutOfFactoryTime = date + outOfFactoryTime,
LeftFactoryTime = date + leftFactoryTime,
};
}
Now that you've created a proper method to convert your CSV-file into a sequence of FactoryProcesses, let's process them
I want to group the data by Robot Name then by date + Left Factory time then group the address(es) for that date and time.
var result = fetchedFactoryProcesses.GroupBy(
// parameter KeySelector: make groups of FactoryProcesses with same RobotName:
factoryProcess => factoryProcess.RobotName,
// parameter ResultSelector: from every group of FactoryProcesses with this RobotName
// make one new Object:
(robotName, processesWithThisRobotName) => new
{
RobotName = robotName,
// Group all processes with this RobotName into groups with same LeftFactoryTime:
LeftFactory = processesWithThisRobotName.GroupBy(
// KeySelector: make groups with same LeftFactoryTime
process => process.LeftFactoryTime,
// ResultSelector: from each group of factory processes with the same LeftFactoryTime
(leftFactoryTime, processesWithThisLeftFactoryTime) => new
{
LeftFactoryTime = leftFactoryTime,
FactoryProcesses = processesWithThisLeftFactoryTime,
// or even better: select only the properties you actually plan to use
FactoryProcesses = processesWithThisLeftFactoryTime.Select(process => new
{
Id = process.Id,
PartNo = process.PartNo,
...
// not needed: you know the value, because it is in this group
// RobotName = process.RobotName,
// LeftFactoryTime = process.LeftFactoryTime,
}),
})
});
For completeness: grouping your code together:
void ProcessData(string fileName)
{
var fetchedFactoryProcesses = ReadFactoryProcess(fileName); // fetch the data
var groups = fetchFactoryProcesses.ToGroups(); // put into groups
this.Display(groups); // output result;
}
Because I separated the input from the conversion of strings to FactoryProcesses, and separated this conversion from the grouping, it will be easy to test the classes separately:
your CSV-reader should return any file that is divided into lines, even if it does not contain FactoryProcesses
your conversion from read line to FactoryProcess should convert any string that is in the proper format, whether it is read from a file or gathered any other way
your grouping should group any sequence of FactoryProcesses, whether they come from a CSV-file or from a database or List<FactoryProcess>, which is very convenient, because in your tests it is way easier to create a test list, than a test CSV-file.
If in future you decide to change the source of your sequence of FactoryProcesses, for instance it comes from a database instead of a CSV-file, your grouping won't change. Or if you decide to support entering and leaving factories on different dates (multiple date values) only the conversion changes. If you decide to display the results in a tree-like fashion, or decide to write the groups in a database, your reading, conversion, grouping, etc won't change: what a high degree or re-usability!
Separating your concerns made it much easier to understand how to solve your grouping problem, without the hassle of splitting your read lines and converting Date + LeftFactory into one value.

Hextoraw() not working with IN clause while using NamedParameterJdbcTemplate

I am trying to update certain rows in my oracle DB using id which is of RAW(255).
Sample ids 0BF3957A016E4EBCB68809E6C2EA8B80, 1199B9F29F0A46F486C052669854C2F8...
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
private static final String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (:ids)";
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();
List<String> idsHexToRaw = new ArrayList<>();
String temp = new String();
for (String id : ids) {
temp = "hextoraw('" + id + "')";
idsHexToRaw.add(temp);
}
paramSource.addValue("ids", idsHexToRaw);
paramSource.addValue("status", status);
jdbcTempalte.update(*UPDATE_SUB_STATUS*, paramSource);
}
This above block of code is executing without any error but the updates are not reflected to the db, while if I skip using hextoraw() and just pass the list of ids it works fine and also updates the data in table. see below code
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();]
paramSource.addValue("ids", ids);
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
this code works fine and updates the table, but since i am not using hextoraw() it scans the full table for updation which I don't want since i have created indexes. So using hextoraw() will use index for scanning the table but it is not updating the values which is kind of weird.
Got a solution myself by trying all the different combinations :
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
public void saveSubscriptionsStatus(List<String> ids, String status) {
String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (";
MapSqlParameterSource paramSource = new MapSqlParameterSource();
String subQuery = "";
for (int i = 0; i < ids.size(); i++) {
String temp = "id" + i;
paramSource.addValue(temp, ids.get(i));
subQuery = subQuery + "hextoraw(:" + temp + "), ";
}
subQuery = subQuery.substring(0, subQuery.length() - 2);
UPDATE_SUB_STATUS = UPDATE_SUB_STATUS + subQuery + ")";
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
What this do is create a query with all the ids to hextoraw as id0, id1, id2...... and also added this values in the MapSqlParameterSource instance and then this worked fine and it also used the index for updating my table.
After running my new function the query look like : update
SUBSCRIPTIONS set status = :status, modified_date = systimestamp
where id in (hextoraw(:id0), hextoraw(:id1), hextoraw(:id2)...)
MapSqlParameterSource instance looks like : {("id0", "randomUUID"),
("id1", "randomUUID"), ("id2", "randomUUID").....}
Instead of doing string manipulation, Convert the list to List of ByteArray
List<byte[]> productGuidByteList = stringList.stream().map(item -> GuidHelper.asBytes(item)).collect(Collectors.toList());
parameters.addValue("productGuidSearch", productGuidByteList);
public static byte[] asBytes(UUID uuid) {
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
return bb.array();
}

Nhibernate paging performance

I have table that contains more than 12 millions of rows.
I need to index this rows using Lucene.NET (I need to perform initial indexing).
So I try to index in batch manner, by reading batch packets from sql (1000 rows per batch).
Here is how it looks:
public void BuildInitialBookSearchIndex()
{
FSDirectory directory = null;
IndexWriter writer = null;
var type = typeof(Book);
var info = new DirectoryInfo(GetIndexDirectory());
//if (info.Exists)
//{
// info.Delete(true);
//}
try
{
directory = FSDirectory.GetDirectory(Path.Combine(info.FullName, type.Name), true);
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
}
finally
{
if (directory != null)
{
directory.Close();
}
if (writer != null)
{
writer.Close();
}
}
var fullTextSession = Search.CreateFullTextSession(Session);
var currentIndex = 0;
const int batchSize = 1000;
while (true)
{
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = Session.BeginTransaction())
{
foreach (var entity in entities)
{
fullTextSession.Index(entity);
}
currentIndex += batchSize;
Session.Flush();
tx.Commit();
Session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
But, the operation times out when current index is bigger then 6-7 million. NHibernate Pagging throws time out.
Any suggestions, any other way in NHibernate to index this 12 millions of rows?
EDIT:
Probably I will implement the most peasant solution.
Because BookId is cluster index in my table and select occurs very fast by BookId, I am going to find max BookId and going through all records and index all of them them.
for (long = 0; long < maxBookId; long++)
{
// get book by bookId
// if book exist, index it
}
If you have any other suggestion, please reply yo this question.
Instead of paging your whole data set, you could try to divide and conquer it. You said you had an index on book id, just change your criteria to return batches of books according to bounds of bookid :
var entities = Session
.CreateCriteria<BookAdditionalInfo>()
.CreateAlias("Book", "b")
.Add(Restrictions.Gte("BookId", low))
.Add(Restrictions.Lt("BookId", high))
.List();
Where low and high are set like 0-1000, 1001-2000, etc

Linq to CSV select by column value

I know I have asked this question in a different manner earlier today but I have refined my needs a little better.
Given the following csv file where the first column is the title and there could be any number of columns;
year,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
income,1000,1500,2000,2100,2100,2100,2100,2100,2100,2100
dividends,100,200,300,300,300,300,300,300,300,300
net profit,1100,1700,2300,2400,2400,2400,2400,2400,2400,2400
expenses,500,600,500,400,400,400,400,400,400,400
profit,600,1100,1800,2000,2000,2000,2000,2000,2000,2000
How do I select the profit value for a given year? So I may provide a year of say 2011 and expect to get the profit value of 2000 back.
At the moment I have this which shows the profit value for each year but ideally I'd like to specify the year and get the profit value;
var data = File.ReadAllLines(fileName)
.Select(
l => {
var split = l.Split(",".ToCharArray());
return split;
}
);
var profit = (from p in data where p[0] == profitFieldName select p).SingleOrDefault();
var years = (from p in data where p[0] == yearFieldName select p).FirstOrDefault();
int columnCount = years.Count() ;
for (int t = 1; t < columnCount; t++)
Console.WriteLine("{0} : ${1}", years[t], profit[t]);
I've already answered this once today, but this answer is a little more fleshed out and hopefully clearer.
string rowName = "profit";
string year = "2011";
var yearRow = data.First();
var yearIndex = Array.IndexOf(yearRow, year);
// get your 'profits' row, or whatever row you want
var row = data.Single(d => d[0] == rowName);
// return the appropriate index for that row.
return row[yearIndex];
This works for me.
You have an unfortunate data format, but I think the best thing to do is just to define a class, create a list, and then use your inputs to create objects to add to the list. Then you can do whatever querying you need to get your desired results.
class MyData
{
public string Year { get; set; }
public decimal Income { get; set; }
public decimal Dividends { get; set; }
public decimal NetProfit { get; set; }
public decimal Expenses { get; set; }
public decimal Profit { get; set; }
}
// ...
string dataFile = #"C:\Temp\data.txt";
List<MyData> list = new List<MyData>();
using (StreamReader reader = new StreamReader(dataFile))
{
string[] years = reader.ReadLine().Split(',');
string[] incomes = reader.ReadLine().Split(',');
string[] dividends = reader.ReadLine().Split(',');
string[] netProfits = reader.ReadLine().Split(',');
string[] expenses = reader.ReadLine().Split(',');
string[] profits = reader.ReadLine().Split(',');
for (int i = 1; i < years.Length; i++) // index 0 is a title
{
MyData myData = new MyData();
myData.Year = years[i];
myData.Income = decimal.Parse(incomes[i]);
myData.Dividends = decimal.Parse(dividends[i]);
myData.NetProfit = decimal.Parse(netProfits[i]);
myData.Expenses = decimal.Parse(expenses[i]);
myData.Profit = decimal.Parse(profits[i]);
list.Add(myData);
}
}
// query for whatever data you need
decimal maxProfit = list.Max(data => data.Profit);

Resources