how to process big file with comparison of each line in that file with remaining all lines in same file? - string-comparison

I have csv file with 5,00,000 records in it. Fields in csv file are as follows
No, Name, Address
Now i want to compare name and address from each record with name and address of all remaining records.
I was doing it in following way
List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++)
{
csvReader = new CSVReader(new StringReader(lines.get(i)));
csvReader = null;
for (int j = i + 1; j < lines.size(); j++)
{
csvReader = new CSVReader(new StringReader(lines.get(j)));
csvReader = null;
application.linesToCompare(lines.get(i),lines.get(j));
}
}
linesToCompare Function will extract name and address from respective parameters and do comaprison. If i found records to be 80% matching(based on name and address) i am marking them as duplicates.
But my this approach is taking too much time to process that csv file.
I want a faster approach may be some kind of map reduce or anything.
Thanks in advance

It is taking a long time because it looks like you are reading the file a huge amount of times.
You first read the file into the lines List, then for every entry you read it again, then inside that you read it again!. Instead of doing this, read the file once into your lines array and then use that to compare the entries against each other.
Something like this might work for you:
List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++)
{
for (int j = i + 1; j < lines.size(); j++)
{
application.linesToCompare(lines.get(i),lines.get(j));
}
}

Related

why the h2 mvstore off heap store file not created

I use the following code to try MVStore off heap store:
OffHeapStore offHeap = new OffHeapStore();
MVStore s = new MVStore.Builder().fileName("c:\\temp\\h2.cache").fileStore(offHeap).open();
int count = 100;
Map<Integer, String> map1 = s.openMap("u1");
for (int i = 0; i < count; i++) {
map1.put(i, "Hello " + i);
}
s.commit();
int size1 = map1.size();
s.close();
System.out.println("=====");
MVStore s2 = new MVStore.Builder().fileStore(offHeap).open();
map1 = s2.openMap("u1");
for (int i = 0; i < size1; i++) {
System.out.println("M1>"+i+","+map1.get(i));
}
s2.close();
The code seems to work. After the code executes, the file "c:\temp\h2.cache" is not created. Why?
OffHeapStore is located in the memory of the process, but outside of Java heap. When you use custom .fileStore(something), .fileName(something) is silently ignored.
If you want to store data on the disk, you need to remove initialization of OffHeapStore, both .fileStore(offHeap) calls and add missing .fileName("c:\\temp\\h2.cache") call to the second MVStore.Builder().

JDBC ResultSet to ArrayList, ArrayList to .txt file

I got the following piece of code that retrieves all rows in a table:
String MakeTXT = "USE SRO_VT_SHARD Select * from _RefTeleLink";
pst = conn.prepareStatement(MakeTXT);
rs = pst.executeQuery();
ArrayList<String> links = new ArrayList<>();
int i = 1;
String rows = "";
while (rs.next()) {
for (i = 1; i <= 22; i++) {
links.add(rs.getString(i));
if (i == 22) {
links.add("\n");
}
}
}
rows = String.join("\t", links);
System.out.println(rows);
}
}
What I want to do is:
Select all rows from the table. See result: prnt.sc/egbh4o
Write all selected rows to a .txt file
.txt file has to look something like this (literally copy pasted the rows): http://prntscr.com/egbhn4
What my code currently outputs:
output
It does this because there are 22 columns, and when the loop reaches 22, it adds an enter to the ArrayList.
What I'm actually looking for is a way to copy an entire row using ResultSet, instead of using a for loop to loop 22 times, and make a row of the 22 results.
Have looked everywhere but couldn't find anything.. :(
You do not need an ArrayList to hold the column values as they are read. I'd better use a StringBuilder as show below, concatenating tabs inside the loop and then replacing the last one with a line feed.
String MakeTXT = "USE SRO_VT_SHARD Select * from _RefTeleLink";
Statement stm = conn.createStatement();
ResultSet rs = stm.executeQuery(MakeTXT);
List<String> rows = new ArrayList<>();
StringBuilder row = new StringBuilder();
ResultSetMetaData meta = rs.getMetaData();
final int colCount = meta.getColumnCount();
while (rs.next()) {
row.setLength(0);
for (int c=0; c<=colCount; c++)
row.append(rs.getString(c)).append("\t");
row.setCharAt(row.length()-1, '\n');
rows.add(row.toString());
}
rs.close();
stm.close();

Finding highest number in text file

I have a text file that contains 50 student names and scores for each student in the format.
foreName.Surname:Mark
I have figured out how to split up each line into a forename, surname and mark using this code.
string[] Lines = File.ReadAllLines(#"StudentExamMarks.txt");
int i = 0;
var items = from line in Lines
where i++ != 0
let words = line.Split(' ', '.', ':')
select new
{
foreName = words[0],
Surname = words[1],
Mark = words[2]
};
I am unsure of how i would incorporate a findMax algorithm into to find the highest mark and display the pupil with the highest mark. this as i have not used text files that often.
You can use any sorting algorithm there is a Pseudo Code available to find maximum number in any list or array..
Try this code, required just parse all files.
string[] lines = File.ReadAllLines(#"StudentExamMarks.txt");
string maxForeName = null;
string maxSurName = null;
var maxMark = 0;
for (int i = 0; i < lines.Length; i++)
{
var tmp = lines[i].Split(new char[] { ' ', '.', ':' }, StringSplitOptions.RemoveEmptyEntries);
if (tmp.Length == 3)
{
int value = int.Parse(tmp[2]);
if (i == 0 || value > maxMark)
{
maxMark = value;
maxForeName = tmp[0];
maxSurName = tmp[1];
}
}
}

In C# VS2013 how do you read a resource txt file one line at a time?

static void Starter(ref int[,] grid)
{
StreamReader reader = new StreamReader(Assembly.GetExecutingAssembly().GetManifestResourceStream(Resources.Sudoku));
string line = reader.ReadLine();
Console.Write(line);
Console.ReadLine();
}
I know this isn't right, but it gets my point across.
I would like to be able to read in the resource file one line at a time.
Like so:
System.IO.StreamReader StringFromTxt
= new System.IO.StreamReader(path);
string line = StringFromTxt.ReadLine();
I do not necessarily have to read in from the resource, but I am not sure of any other way to call a text file without knowing the directory every time, or hard coding it. I can't have the user pick files.
StreamReader sr = new StreamReader("D:\\CountryCodew.txt");
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
}
MSDN lists the following as the way to read in one line at a time:
https://msdn.microsoft.com/en-us/library/aa287535(v=vs.71).aspx
int counter = 0; //keep track of #lines read
string line;
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader("c:\\test.txt");
while((line = file.ReadLine()) != null)
{
Console.WriteLine (line);
counter++;
}
file.Close();
// Suspend the screen.
Console.ReadLine();
Additional examples for getline:
https://msdn.microsoft.com/en-us/library/2whx1zkx.aspx

Deleting Particular repeated field data from Google protocol buffer

.proto file structure
message repetedMSG
{
required string data = 1;
}
message mainMSG
{
required repetedMSG_id = 1;
repeated repetedMSG rptMSG = 2;
}
I have one mainMSG and in it too many (suppose 10) repetedMSG are present.
Now i want to delete any particular repetedMSG (suppose 5th repetedMSG )from mainMSG. For this i tried 3 ways but none of them worked.
for (int j = 0; j<mainMSG->repetedMSG_size(); j++){
repetedMSG reptMsg = mainMsg->mutable_repetedMSG(j);
if (QString::fromStdString(reptMsg->data).compare("deleteMe") == 0){
*First tried way:-* reptMsg->Clear();
*Second tried Way:-* delete reptMsg;
*Third tried way:-* reptMsg->clear_formula_name();
break;
}
}
I get run-time error when i serialize the mainMSG for writing to a file i.e. when execute this line
mainMSG.SerializeToOstream (std::fstream output("C:/A/test1", std::ios::out | std::ios::trunc | std::ios::binary)) here i get run-time error
You can use RepeatedPtrField::DeleteSubrange() for this. However, be careful about using this in a loop -- people commonly write code like this which is O(n^2):
// BAD CODE! O(n^2)!
for (int i = 0; i < message.foo_size(); i++) {
if (should_filter(message.foo(i))) {
message.mutable_foo()->DeleteSubrange(i, 1);
--i;
}
}
Instead, if you plan to remove multiple elements, do something like this:
// Move all filtered elements to the end of the list.
int keep = 0; // number to keep
for (int i = 0; i < message.foo_size(); i++) {
if (should_filter(message.foo(i))) {
// Skip.
} else {
if (keep < i) {
message.mutable_foo()->SwapElements(i, keep)
}
++keep;
}
}
// Remove the filtered elements.
message.mutable_foo()->DeleteSubrange(keep, message.foo_size() - keep);

Resources