I have an application built on windows form. With this application I am reading a text file of ten thousands of lines. I process individual rows in the file and save them to the database. While the first lines are very fast when processing, the processing time increases as the process continues. Looks like there's a memory increase. I don't know what is lacking. Waiting for your suggestions.
These are the codes that I use:
if (ofData.ShowDialog() == DialogResult.OK)
{
//ofData => OpenFileDialog name
string[] lines = File.ReadAllLines(ofData.FileName, Encoding.UTF8);
foreach (string line in lines)
{
string[] res = line.Split('#');
string opaqId = res[0].Trim();
string name = res[1].Trim();
string surName = res[2].Trim();
Student s = new Student();
s.opaq = opaqId;
s.name = name;
s.surname = surName;
studentManager.Insert(s); //EntityFramework insert database
}
}
ofData.Dispose();
Related
I'm trying to find a set of unique strings and break it up into disjoint groups by criterium: if two strings have coincidence at 1 or more columns, it belongs to one group.
For example
111;123;222
200;123;100
300;;100
All of them are belong to one group, cause of overlap at:
first string with second by "123" value
second string with third by "100" value
After getting these groups, I need to save them to a text file.
I got 60 MB file with strings, which should be sorted.(time limit: 30 sec)
I think. First, the best way is to divide strings into columns, and then try to find any coincidence, but I'm not sure at all.
Please, help me to find the solution.
For now, I have this code; it works about 2.5-3 sec:
// getting from file
File file = new File(path);
InputStream inputFS = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
List<String> inputList = br.lines().collect(Collectors.toList());
br.close();
List<String> firstValues = new ArrayList<>();
List<String> secondValues = new ArrayList<>();
List<String> thirdValues = new ArrayList<>();
// extracting distinct values and splitting
final String qq = "\"";
inputList.stream()
.map(s -> s.split(";"))
.forEach(strings -> {
firstValues.add(strings.length > 0 ? strings[0].replaceAll(qq, "") : null);
secondValues.add(strings.length > 1 ? strings[1].replaceAll(qq, "") : null);
thirdValues.add(strings.length > 2 ? strings[2].replaceAll(qq, "") : null);
});
// todo: add to maps by the row and then find groups
I am reading a big text file using Java. The file has 5.000.000 of rows and each one have 3 columns. The file size is 350 MB.
For each row, I read it, I create an object using Criteria on Maven and I store it into a Postgresql database with a session.saveOrUpdate(object) command.
In the database I have a table with a serial ID and three attributes where I store the three columns of the file.
At the beginning, the process run "fast" (35.000 registers in 30 min) but every time is slower and the time to finish grow exponentially. How can I improve the process??
I have tried to split the big file into several smaller files but it is almost slower.
Many thanks in advance!
PD: The code
public void process(){
File archivo = null;
FileReader fr = null;
BufferedReader br = null;
String linea;
String [] columna;
try{
archivo = new File ("/home/josealopez/Escritorio/file.txt");
fr = new FileReader (archivo);
br = new BufferedReader(fr);
while((linea=br.readLine())!=null){
columna = linea.split(";");
saveIntoBBDD(columna[0],columna[1],columna[2]);
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
try{
if( null != fr ){
fr.close();
}
}
catch (Exception e2){
e2.printStackTrace();
}
}
}
#CommitAfter
public void saveIntoBBDD(String lon, String lat, String met){
Object b = new Object();
b.setLon(Double.parseDouble(lon));
b.setLat(Double.parseDouble(lat));
b.setMeters(Double.parseDouble(met));
session.saveOrUpdate(b);
}
You should focus on running this as a bulk process and line-based processing is your issue here. PostgreSQL has built-in command for bulk file loading, named COPY, that can deal with Comma Separated Files and Tab Separated Files. Of course, delimiter, quotations chars and many other settings are customizable.
Please, check official PostgreSQL documentation on DB population and also details of the COPY command.
In this answer I provided a small example of how I do similar kind of things.
I have an array of strings from a log file with the following format:
var lines = new []
{
"--------",
"TimeStamp: 12:45",
"Message: Message #1",
"--------",
"--------",
"TimeStamp: 12:54",
"Message: Message #2",
"--------",
"--------",
"Message: Message #3",
"TimeStamp: 12:55",
"--------"
}
I want to group each set of lines (as delimited by "--------") into a list using LINQ. Basically, I want a List<List<string>> or similar where each inner list contains 4 strings - 2 separators, a timestamp and a message.
I should add that I would like to make this as generic as possible, as the log-file format could change.
Can this be done?
Will this work?
var result = Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
EDIT:
This looks a little hacky but I'm sure it can be cleaned up
IEnumerable<List<String>> GetLogGroups(string[] lines)
{
var list = new List<String>();
foreach (var line in lines)
{
list.Add(line);
if (list.Count(l => l.All(c => c == '-')) == 2)
{
yield return list;
list = new List<string>();
}
}
}
You should be able to actually do better than returning a List>. If you're using C# 4, you could project each set of values into a dynamic type where the string before the colon becomes the property name and the value is on the left-hand side. You then create a custom iterator which reads the lines until the end "------" appears in each set and then yield return that row. On MoveNext, you read the next set of lines. Rinse and repeat until EOF. I don't have time at the moment to write up a full implementation, but my sample on reading in CSV and using LINQ over the dynamic objects may give you an idea of what you can do. See http://www.thinqlinq.com/Post.aspx/Title/LINQ-to-CSV-using-DynamicObject. (note this sample is in VB, but the same can be done in C# as well with some modifications).
The iterator implementation has the added benefit of not having to load the entire document into memory before parsing. With this version, you only load the amount for one set of blocks at a time. It allows you to handle really large files.
Assuming that your structure is always
delimeter
TimeStamp
Message
delimeter
public List<List<String>> ConvertLog(String[] log)
{
var LogSet = new List<List<String>>();
for(i = 0; i < log.Length(); i += 4)
{
if (log.Length <= i+3)
{
var set = new List<String>() { log[i], log[i+1], log[i+2], log[i+3] };
LogSet.Add(set);
}
}
}
Or in Linq
public List<List<String> ConvertLog(String[] log)
{
return Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
}
I have a bunch of names in alphabetical order with multiple instances of the same name all in alphabetical order so that the names are all grouped together. Beside each name, after a coma, I have a role that has been assigned to them, one name-role pair per line, something like whats shown below
name1,role1
name1,role2
name1,role3
name1,role8
name2,role8
name2,role2
name2,role4
name3,role1
name4,role5
name4,role1
...
..
.
I am looking for an algorithm to take the above .csv file as input create an output .csv file in the following format
name1,role1,role2,role3,role8
name2,role8,role2,role4
name3,role1
name4,role5,role1
...
..
.
So basically I want each name to appear only once and then the roles to be printed in csv format next to the names for all names and roles in the input file.
The algorithm should be language independent. I would appreciate it if it does NOT use OOP principles :-) I am a newbie.
Obviously has some formatting bugs but this will get you started.
var lastName = "";
do{
var name = readName();
var role = readRole();
if(lastName!=name){
print("\n"+name+",");
lastName = name;
}
print(role+",");
}while(reader.isReady());
This is easy to do if your language has associative arrays: arrays that can be indexed by anything (such as a string) rather than just numbers. Some languages call them "hashes," "maps," or "dictionaries."
On the other hand, if you can guarantee that the names are grouped together as in your sample data, Stefan's solution works quite well.
It's kind of a pity you said it had to be language-agnostic because Python is rather well-qualified for this:
import itertools
def split(s):
return s.strip().split(',', 1)
with open(filename, 'r') as f:
for name, lines in itertools.groupby(f, lambda s: split(s)[0])
print name + ',' + ','.join(split(s)[1] for s in lines)
Basically the groupby call takes all consecutive lines with the same name and groups them together.
Now that I think about it, though, Stefan's answer is probably more efficient.
Here is a solution in Java:
Scanner sc = new Scanner (new File(fileName));
Map<String, List<String>> nameRoles = new HashMap<String, List<String>> ();
while (sc.hasNextLine()) {
String line = sc.nextLine();
String args[] = line.split (",");
if (nameRoles.containsKey(args[0]) {
nameRoles.get(args[0]).add(args[1]);
} else {
List<String> roles = new ArrayList<String>();
roles.add(args[1]);
nameRoles.put(args[0], roles);
}
}
// then print it out
for (String name : nameRoles.keySet()) {
List<String> roles = nameRoles.get(name);
System.out.print(name + ",");
for (String role : roles) {
System.out.print(role + ",");
}
System.out.println();
}
With this approach, you can work with an random input like:
name1,role1
name3,role1
name2,role8
name1,role2
name2,role2
name4,role5
name4,role1
Here it is in C# using nothing fancy. It should be self-explanatory:
static void Main(string[] args)
{
using (StreamReader file = new StreamReader("input.txt"))
{
string prevName = "";
while (!file.EndOfStream)
{
string line = file.ReadLine(); // read a line
string[] tokens = line.Split(','); // split the name and the parameter
string name = tokens[0]; // this is the name
string param = tokens[1]; // this is the parameter
if (name == prevName) // if the name is the same as the previous name we read, we add the current param to that name. This works right because the names are sorted.
{
Console.Write(param + " ");
}
else // otherwise, we are definitely done with the previous name, and have printed all of its parameters (due to the sorting).
{
if (prevName != "") // make sure we don't print an extra newline the first time around
{
Console.WriteLine();
}
Console.Write(name + ": " + param + " "); // write the name followed by the first parameter. The output format can easily be tweaked to print commas.
prevName = name; // store the new name as the previous name.
}
}
}
}
I have a file which represents items, in one line there's Item GUID followed by 5 lines describing the item.
Example:
Line 1: Guid=8e2803d1-444a-4893-a23d-d3b4ba51baee name= line1
Line 2: Item details = bla bla
.
.
Line 7: Guid=79e5e39d-0c17-42aa-a7c4-c5fa9bfe7309 name= line7
Line 8: Item details = bla bla
.
.
I am trying to access this file first to get the GUIDs of the items meet the criteria provided using LINQ e.g. where line.Contains("line1").. This way I will get the whole line, I will extract the GUID from there, I want to pass this GUID to another function which should access the file "again", find that line (where line.Contains("line1") && line.Contains("8e2803d1-444a-4893-a23d-d3b4ba51baee") and reads the next 5 lines starting from that line.
Is there any efficient way to do so?
I don't think it really makes sense to use LINQ entirely given the requirements of what you need to do and given that the index of the line in the array is fairy integral. I would also recommend doing everything in one pass - opening the file multiple times won't be as efficient as just reading everything once and processing it immediately. As long as the file is structured as well as you describe, this won't be terribly difficult:
private void GetStuff()
{
var lines = File.ReadAllLines("foo.txt");
var result = new Dictionary<Guid, String[]>();
for (var index = 0; index < lines.Length; index += 6)
{
var item = new
{
Guid = new Guid(lines[index]),
Description = lines.Skip(index + 1).Take(5).ToArray()
};
result.Add(item.Guid, item.Description);
}
}
I tried a couple different ways to do this with LINQ but nothing allowed me to do a single scan of the file. For this scenario you're talking about I would go down to the Enumerable level and use the GetEnumerator like this:
public IEnumerable<LogData> GetLogData(string filename)
{
var line1Regex = #"Line\s(\d+):\sGuid=([0123456789abcdefg]{8}-[0123456789abcdefg]{4}-[0123456789abcdefg]{4}-[0123456789abcdefg]{4}-[0123456789abcdefg]{12})\sname=\s(\w*)";
int detailLines = 4;
var lines = File.ReadAllLines(filename).GetEnumerator();
while (lines.MoveNext())
{
var line = (string)lines.Current;
var match = Regex.Match(line, line1Regex);
if (!match.Success)
continue;
var details = new string[detailLines];
for (int i = 0; i < detailLines && lines.MoveNext(); i++)
{
details[i] = (string)lines.Current;
}
yield return new LogData
{
Id = new Guid(match.Groups[2].Value),
Name = match.Groups[3].Value,
LineNumber = int.Parse(match.Groups[1].Value),
Details = details
};
}
}