I am reading a big text file using Java. The file has 5.000.000 of rows and each one have 3 columns. The file size is 350 MB.
For each row, I read it, I create an object using Criteria on Maven and I store it into a Postgresql database with a session.saveOrUpdate(object) command.
In the database I have a table with a serial ID and three attributes where I store the three columns of the file.
At the beginning, the process run "fast" (35.000 registers in 30 min) but every time is slower and the time to finish grow exponentially. How can I improve the process??
I have tried to split the big file into several smaller files but it is almost slower.
Many thanks in advance!
PD: The code
public void process(){
File archivo = null;
FileReader fr = null;
BufferedReader br = null;
String linea;
String [] columna;
try{
archivo = new File ("/home/josealopez/Escritorio/file.txt");
fr = new FileReader (archivo);
br = new BufferedReader(fr);
while((linea=br.readLine())!=null){
columna = linea.split(";");
saveIntoBBDD(columna[0],columna[1],columna[2]);
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
try{
if( null != fr ){
fr.close();
}
}
catch (Exception e2){
e2.printStackTrace();
}
}
}
#CommitAfter
public void saveIntoBBDD(String lon, String lat, String met){
Object b = new Object();
b.setLon(Double.parseDouble(lon));
b.setLat(Double.parseDouble(lat));
b.setMeters(Double.parseDouble(met));
session.saveOrUpdate(b);
}
You should focus on running this as a bulk process and line-based processing is your issue here. PostgreSQL has built-in command for bulk file loading, named COPY, that can deal with Comma Separated Files and Tab Separated Files. Of course, delimiter, quotations chars and many other settings are customizable.
Please, check official PostgreSQL documentation on DB population and also details of the COPY command.
In this answer I provided a small example of how I do similar kind of things.
Related
I have an application built on windows form. With this application I am reading a text file of ten thousands of lines. I process individual rows in the file and save them to the database. While the first lines are very fast when processing, the processing time increases as the process continues. Looks like there's a memory increase. I don't know what is lacking. Waiting for your suggestions.
These are the codes that I use:
if (ofData.ShowDialog() == DialogResult.OK)
{
//ofData => OpenFileDialog name
string[] lines = File.ReadAllLines(ofData.FileName, Encoding.UTF8);
foreach (string line in lines)
{
string[] res = line.Split('#');
string opaqId = res[0].Trim();
string name = res[1].Trim();
string surName = res[2].Trim();
Student s = new Student();
s.opaq = opaqId;
s.name = name;
s.surname = surName;
studentManager.Insert(s); //EntityFramework insert database
}
}
ofData.Dispose();
I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.
Our project recently updated to the newer Oracle.ManagedDataAccess DLL's (v 4.121.2.0) and this error has been cropping up intermittently. We've fixed it a few times without really knowing what we did to fix it.
I'm fairly certain it's caused by CLOB fields being mapped to strings in Entity Framework, and then being selected in LINQ statements that pull entire entities instead of just a limited set of properties.
Error:
Value cannot be null.
Parameter name: byteArray
Stack Trace:
at System.BitConverter.ToString(Byte[] value, Int32 startIndex, Int32 length)
at OracleInternal.TTC.TTCLob.GetLobIdString(Byte[] lobLocator)
at OracleInternal.ServiceObjects.OracleDataReaderImpl.CollectTempLOBsToBeFreed(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.ProcessAnyTempLOBs(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.Read()
at System.Data.Entity.Core.Common.Internal.Materialization.Shaper`1.StoreRead()
Suspect Entity Properties:
'Mapped to Oracle CLOB Column'
<Column("LARGEFIELD")>
Public Property LargeField As String
But I'm confident this is the proper way to map the fields per Oracle's matrix:
ODP.NET Types Overview
There is nothing obviously wrong with the generated SQL statement either:
SELECT
...
"Extent1"."LARGEFIELD" AS "LARGEFIELD",
...
FROM ... "Extent1"
WHERE ...
I have also tried this Fluent code per Ozkan's suggestion, but it does not seem to affect my case.
modelBuilder.Entity(Of [CLASS])().Property(
Function(x) x.LargeField
).IsOptional()
Troubleshooting Update:
After extensive testing, we are quite certain this is actually a bug, not a configuration problem. It appears to be the contents of the CLOB that cause the problem under a very specific set of circumstances. I've cross-posted this on the Oracle Forums, hoping for more information.
After installation of Oracle12 client we ran into the same problem.
In machine.config (C:\Windows\Microsoft.NET\Framework\v4.0.30319\Config) I removed all entries with Oracle.ManagedDataAccess.
In directory C:\Windows\Microsoft.NET\assembly\GAC_MSIL I removed both Oracle.ManagedDataAccess and Policy.4.121.Oracle.ManagedDataAccess.
Then my C# program started working as usually, using the Oracle.ManagedDataAccess dll in it's own directory.
We met this problem in our project an hour ago and found a solution. It is generating this error because of null values in CLOB caolumn. We have a CLOB column and it is Nullable in database. In EntityFramework model it is String but not Nullable. We changed column's Nullable property to True in EF model and it fixed problem.
We have this problem as well on some computers, and we are running the latest Oracle.ManagedDataAccess.dll (4.121.2.20150926 ODAC RELEASE 4).
We found a solution to our problem, and I just wanted to share.
This was our problem that occurred some computers.
Using connection As New OracleConnection(yourConnectionString)
Dim command As New OracleCommand(yourQuery, connection)
connection.Open()
Using reader As OracleDataReader = command.ExecuteReader()
Dim clobField As String = CStr(reader.Item("CLOB_FIELD"))
End Using
connection.Close()
End Using
And here's the solution that made it work on all computers.
Using connection As New OracleConnection(yourConnectionString)
Dim command As New OracleCommand(yourQuery, connection)
connection.Open()
Using reader As OracleDataReader = command.ExecuteReader()
Dim clobField As String = reader.GetOracleClob(0).Value
End Using
connection.Close()
End Using
I spent a lot of time trying to decipher this and found bits of this and that here and there on the internet, but no where had everything in one place, so I'd like to post what I've learned, and how I resolved it, which is much like Ragowit's answer, but I've got the C# code for it.
Background
The error: So I had this error on my while (dr.Read()) line:
Value cannot be null. \r\nParmeter name: byteArray
I ran into very little on the internet about this, except that it was an error with the CLOB field when it was null, and was supposedly fixed in the latest ODAC release, according to this: https://community.oracle.com/thread/3944924
My take on this -- NOT TRUE! It hasn't been updated since October 5, 2015 (http://www.oracle.com/technetwork/topics/dotnet/utilsoft-086879.html), and the 12c package I'm using was downloaded in April 2016.
Full stack trace by someone else with the error that pretty much mirrored mine: http://pastebin.com/24AfFDnq
Value cannot be null.
Parameter name: byteArray
at System.BitConverter.ToString(Byte[] value, Int32 startIndex, Int32 length)
at OracleInternal.TTC.TTCLob.GetLobIdString(Byte[] lobLocator)
at OracleInternal.ServiceObjects.OracleDataReaderImpl.CollectTempLOBsToBeFreed(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.ProcessAnyTempLOBs(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.Read()
at System.Data.Entity.Core.Common.Internal.Materialization.Shaper`1.StoreRead()
'Mapped to Oracle CLOB Column'
<Column("LARGEFIELD")>
Public Property LargeField As String
'Mapped to Oracle BLOB Column'
<Column("IMAGE")>
Public Property FileContents As Byte()
How I encountered it: It was while reading an 11-column table of about 3000 rows. One of the columns was actually an NCLOB (so apparently this is just as susceptible as CLOB), which allowed nulls in the database, and some of its values were empty - it was an optional "Notes" field, after all. It's funny that I didn't get this error on the first or even second row that had an empty Notes field. It didn't error until row 768 finished and it was about to start row 769, according to an int counter variable that started at 0 that I set up and saw after checking how many rows my DataTable had thus far. I found I got the error if I used:
DataSet ds = new DataSet();
OracleDataAdapter adapter = new OracleDataAdapter(cmd);
adapter.Fill(ds);
as well if I used:
DataTable dt = new DataTable();
OracleDataReader dr = cmd.ExecuteReader();
dt.Load(dr);
or if I used:
OracleDataReader dr = cmd.ExecuteReader();
if (dr.HasRows)
{
while (dr.Read())
{
....
}
}
where cmd is the OracleCommand, so it made no difference.
Resolution
The following is basically the code I used to parse through an OracleDataReader values in order to assign them to a DataTable. It's actually not as refined as it could be - I am using it to just return dr[i] to datarow in all cases except when the value is null, and when it is the eleventh column (index = 10, because it starts at 0) and a particular query has been executed so that I know where my NCLOB column is.
public static DataTable GetDataTableManually(string query)
{
OracleConnection conn = null;
try
{
string connString = ConfigurationManager.ConnectionStrings["MyConn"].ConnectionString;
conn = new OracleConnection(connString);
OracleCommand cmd = new OracleCommand(query, conn);
conn.Open();
OracleDataReader dr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
DataTable dtSchema = dr.GetSchemaTable();
DataTable dt = new DataTable();
List<DataColumn> listCols = new List<DataColumn>();
List<DataColumn> listTypes = new List<DataColumn>();
if (dtSchema != null)
{
foreach (DataRow drow in dtSchema.Rows)
{
string columnName = System.Convert.ToString(drow["ColumnName"]);
DataColumn column = new DataColumn(columnName, (Type)(drow["DataType"]));
listCols.Add(column);
listTypes.Add(drow["DataType"].ToString()); // necessary in order to record nulls
dt.Columns.Add(column);
}
}
// Read rows from DataReader and populate the DataTable
if (dr.HasRows)
{
int rowCount = 0;
while (dr.Read())
{
string fieldType = String.Empty;
DataRow dataRow = dt.NewRow();
for (int i = 0; i < dr.FieldCount; i++)
{
if (!dr.IsDBNull[i])
{
fieldType = dr.GetFieldType(i).ToString(); // example only, this is the same as listTypes[i], and neither help us distinguish NCLOB from NVARCHAR2 - both will say System.String
// This is the magic
if (query == "SELECT * FROM Orders" && i == 10)
dataRow[((DataColumn)listCols[i])] = dr.GetOracleClob(i); // <-- our new check!!!!
// Found if you have null Decimal fields, this is
// also needed, and GetOracleDecimal and GetDecimal
// will not help you - only GetFloat does
else if (listTypes[i] == "System.Decimal")
dataRow[((DataColumn)listCols[i])] = dr.GetFloat(i);
else
dataRow[((DataColumn)listCols[i])] = dr[i];
}
else // value was null; we can't always assign dr[i] if DBNull, such as when it is a number or decimal field
{
byte[] nullArray = new byte[0];
switch (listTypes[i])
{
case "System.String": // includes NVARCHAR2, CLOB, NCLOB, etc.
dataRow[((DataColumn)listCols[i])] = String.Empty;
break;
case "System.Decimal":
case "System.Int16": // Boolean
case "System.Int32": // Number
dataRow[((DataColumn)listCols[i])] = 0;
break;
case "System.DateTime":
dataRow[((DataColumn)listCols[i])] = DBNull.Value;
break;
case "System.Byte[]": // Blob
dataRow[((DataColumn)listCols[i])] = nullArray;
break;
default:
dataRow[((DataColumn)listCols[i])] = String.Empty;
break;
}
}
}
dt.Rows.Add(dataRow);
}
ds.Tables.Add(dt);
}
}
catch (Exception ex)
{
// handle error
}
finally
{
conn.Close();
}
// After everything is closed
if (ds.Tables.Count > 0)
return ds.Tables[0]; // there should only be one table if we got results
else
return null;
}
In the same way that I have it assigning specific types of null based on the column type found in the schema table loop, you could add the conditions to the "not null" side of the if...then and do various GetOracle... statements there. I found it was only necessary for this NCLOB instance, though.
To give credit where credit is due, the original codebase is based on the answer given by sarathkumar at Populate data table from data reader .
For me it was simple! I had this error with odac v 4.121.1.0. I have just updated Oracle.ManagedDataAccess to 4.121.2.0 with Nuget and now it is working.
Have you have tried to uninstall and reinstall Oracle.ManagedDataAccess with Nugget?
upgrade Oracle.ManagedDataAccess.dll to version 4.122.1.0 solved.
If you are using vs 2017, we can update via NuGet.
in my current ASP.net MVC 3.0 project i am stuck with a situation.
I have four .txt files each has approximatly 100k rows of records
These files will be replaced with new files on weekly bases.
I need to Query data from these four text files, I am not able to choose the best and efficient way to do this.
3 ways I could think
Convert these text files to XML on a weekly basis and query it with Linq-XML
Run a batch import weekly from txt to SQL Server and query using Linq-Entities
avoid all conversions and query directly from text files.
Can any one suggest a best way to deal with this situation.
update:
Url of the Text File
I should connect to this file with credentials.
once i connect successfully, I will have the text file as below with Pipeline as Deliminator
This is the text file
Now i have to look up for the field highlighted in yellow and get the data in that row.
Note: First two lines of the text file are headers of the File.
Well As i Found a way my self. Hope this will be useful for any who are interested to get this done.
string url = "https://myurl.com/folder/file.txt";
WebClient request = new WebClient();
request.Credentials = new NetworkCredential(ConfigurationManager.AppSettings["UserName"], ConfigurationManager.AppSettings["Password"]);
Stream s = request.OpenRead(url);
using (StreamReader strReader = new StreamReader(s))
{
for (int i = 0; i <= 1; i++)
strReader.ReadLine();
while (!strReader.EndOfStream)
{
var CurrentLine = strReader.ReadLine();
var count = CurrentLine.Split('|').Count();
if (count > 3 && CurrentLine.Split('|')[3].Equals("SearchString"))
{
#region Bind Data to Model
//var Line = CurrentLine.Split('|');
//CID.RecordType = Line[0];
//CID.ChangeIdentifier = Line[1];
//CID.CoverageID = Convert.ToInt32(Line[2]);
//CID.NationalDrugCode = Line[3];
//CID.DrugQualifier = Convert.ToInt32(Line[4]);
#endregion
break;
}
}
s.Close();
}
request.Dispose();
This is a bit of subjective question about a specific situation. Main goal for this question for me is to remind my self to code up the solution. However if there is already a solution, or an alternate approach, I would like to know it.
I'm working on a project and I'm using Entity Framework 4 for database access. The database design is something that I don't have control over. The database was designed many years ago, and in my opinion the database design does not fit for the current database purposes. This results in very complicated queries.
This is the first time I'm using Entity Framework in a project, but I have extensive experience in development against MS SQL Server.
What I found myself doing again and again is this:
I write a complex L2E query. The query either slow or returns wrong results
I'm looking at my L2E query and I have absolutely no idea how to improve it
I fire up SQL Profiler and capture the SQL that EF generated from my query
I want to execute part of that sql to identify the part of the query that is giving problems
The query comes through as sp_executesql with a dozen of parameters, because if a parameter is used 3 times in a query, L2E creates 3 parameters and passes to all of them the same value. Same deal with every parameter.
Now I have to extract the SQL from sp_executesql, unescape all escaped apostrophes, and substitute every parameter in the query with its value
After this is done I finally can run parts of the query and pin-point the problem.
I go back to my L2E code, change it to fix the problem I found and the cycle repeats.
To be honest, I'm starting thinking that one should not use an ORM if you don't own database design.
This aside, the process of unescaping the sql and substituting the parameters is the one that I want to automate. The goal is to get 'naked', de-parametrized sql, that I can run in SSMS.
This a very simple example of what I see in the profile and what I want to get in result. My real cases are many times more complex.
The capture:
exec sp_executesql N'SELECT
[Extent1].[ProductName] AS [ProductName]
FROM [dbo].[Products] AS [Extent1]
INNER JOIN [dbo].[Categories] AS [Extent2] ON [Extent1].[CategoryID] = [Extent2].[CategoryID]
WHERE ([Extent1].[UnitPrice] > #p__linq__0) AND ([Extent2].[CategoryName] = #p__linq__1) AND (N''Chang'' <> [Extent1].[ProductName])',N'#p__linq__0 decimal(1,0),#p__linq__1 nvarchar(4000)',#p__linq__0=1,#p__linq__1=N'Beverages'
Desired result:
SELECT
[Extent1].[ProductName] AS [ProductName]
FROM [dbo].[Products] AS [Extent1]
INNER JOIN [dbo].[Categories] AS [Extent2] ON [Extent1].[CategoryID] = [Extent2].[CategoryID]
WHERE ([Extent1].[UnitPrice] > 1) AND ([Extent2].[CategoryName] = N'Beverages') AND (N'Chang' <> [Extent1].[ProductName])
I'm just going to write code to convert the likes of first to the likes of second if there is nothing better, I'll post the solution here. But maybe it's already done by someone? Or maybe there is a profiler or something, that can give me sql code I can execute partially in SSMS?
So here is what I ended up with. A couple of notes:
This won't work in 100% of cases, but this is good enough for me
There is a lot to improve in terms of usability. Currently I put a shortcut to the compiled binary on the desktop, cut the text to convert to clipboard, double-click the shortcut and paste the result.
using System;
using System.Text.RegularExpressions;
using System.Windows.Forms;
namespace EFC
{
static class Program
{
[STAThread]
static void Main()
{
try
{
string input = Clipboard.GetText();
const string header = "exec sp_executesql N'";
CheckValidInput(input.StartsWith(header), "Input does not start with {0}", header);
// Find part of the statement that constitutes whatever sp_executesql has to execute
int bodyStartIndex = header.Length;
int bodyEndIndex = FindClosingApostroph(input, bodyStartIndex);
CheckValidInput(bodyEndIndex > 0, "Unable to find closing \"'\" in the body");
string body = input.Substring(bodyStartIndex, bodyEndIndex - bodyStartIndex);
// Unescape 's
body = body.Replace("''", "'");
// Work out where the paramters are
int blobEndIndex = FindClosingApostroph(input, bodyEndIndex + 4);
CheckValidInput(bodyEndIndex > 0, "Unable to find closing \"'\" in the params");
string ps = input.Substring(blobEndIndex);
// Reverse, so that P__linq_2 does not get substituted in p__linq_20
Regex regexEf = new Regex(#"(?<name>#p__linq__(?:\d+))=(?<value>(?:.+?)((?=,#p)|($)))", RegexOptions.RightToLeft);
Regex regexLinqToSql = new Regex(#"(?<name>#p(?:\d+))=(?<value>(?:.+?)((?=,#p)|($)))", RegexOptions.RightToLeft);
MatchCollection mcEf = regexEf.Matches(ps);
MatchCollection mcLinqToSql = regexLinqToSql.Matches(ps);
MatchCollection mc = mcEf.Count > 0 ? mcEf : mcLinqToSql;
// substitutes parameters in the statement with their values
foreach (Match m in mc)
{
string name = m.Groups["name"].Value;
string value = m.Groups["value"].Value;
body = body.Replace(name, value);
}
Clipboard.SetText(body);
MessageBox.Show("Done!", "CEF");
}
catch (ApplicationException ex)
{
MessageBox.Show(ex.Message, "Error");
}
catch (Exception ex)
{
MessageBox.Show(ex.Message, "Error");
MessageBox.Show(ex.StackTrace, "Error");
}
}
static int FindClosingApostroph(string input, int bodyStartIndex)
{
for (int i = bodyStartIndex; i < input.Length; i++)
{
if (input[i] == '\'' && i + 1 < input.Length)
{
if (input[i + 1] != '\'')
{
return i;
}
i++;
}
}
return -1;
}
static void CheckValidInput(bool isValid, string message, params object[] args)
{
if (!isValid)
{
throw new ApplicationException(string.Format(message, args));
}
}
}
}
Well,may be this will be helpfull. MSVS 2010 has IntelliTrace. Every time when EF make a query there is an ADO.Net Event with a query
Execute Reader "SELECT TOP (1)
[Extent1].[id] AS [id],
[Extent1].[Sid] AS [Sid],
[Extent1].[Queue] AS [Queue],
[Extent1].[Extension] AS [Extension]
FROM [dbo].[Operators] AS [Extent1]
WHERE [Extent1].[Sid] = #p__linq__0" Command Text = "SELECT TOP (1) \r\n[Extent1].[id] AS [id], \r\n[Extent1].[Sid] AS [Sid], \r\n[Extent1].[Queue] AS [Queue], \r\n[Extent1].[Extension] AS [Extension]\r\nFROM [dbo].[Operators] AS [Extent1]\r\nWHERE [Extent1].[Sid] = #p__linq__0",
Connection String = "Data Source=paris;Initial Catalog=telephony;Integrated Security=True;MultipleActiveResultSets=True"