Suggestions for find and reporting duplicate lines in csv [closed] - supercsv

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I have multiple projects that contain varying number of csv files for which I'm using SuperCSV CsvBeanReader to perform the mapping and cell validation. I have created a bean per csv file and overide; equals, hashCode and toString for each bean.
I am looking for suggestions on what might be the best "all project" implementation method to perform csv line duplication identification. Reporting (not removal) original csv line number and line content, as well as, the line number and line content of all duplicate lines found. Some of the files can reach hundreds of thousands of lines, over GB plus in size, and wish to minimize the number of reads per file and thought it could be accomplished while CsvBeanReader had the file open.
Thank you in advance.

Given the size of your files and the fact that you want the line content of the original and duplicates, I think the best you can do is 2 passes over the file.
If you only wanted the latest line content for a duplicate, you could get away with 1 pass. Keeping track of the line content for the original plus all duplicates in 1 pass means you'd have to store the content of every row - you'd probably run out of memory.
My solution assumes two beans with the same hashCode() are duplicates. If you have to use equals() then it gets more complicated.
Pass 1: identify the duplicates (record the row numbers for each
duplicate hash)
Pass 2: report on the duplicates
Pass 1: Identify duplicates
/**
* Finds the row numbers with duplicate records (using the bean's hashCode()
* method). The key of the returned map is the hashCode and the value is the
* Set of duplicate row numbers for that hashcode.
*
* #param reader
* the reader
* #param preference
* the preferences
* #param beanClass
* the bean class
* #param processors
* the cell processors
* #return the map of duplicate rows (by hashcode)
* #throws IOException
*/
private static Map<Integer, Set<Integer>> findDuplicates(
final Reader reader, final CsvPreference preference,
final Class<?> beanClass, final CellProcessor[] processors)
throws IOException {
ICsvBeanReader beanReader = null;
try {
beanReader = new CsvBeanReader(reader, preference);
final String[] header = beanReader.getHeader(true);
// the hashes of any duplicates
final Set<Integer> duplicateHashes = new HashSet<Integer>();
// the hashes for each row
final Map<Integer, Set<Integer>> rowNumbersByHash =
new HashMap<Integer, Set<Integer>>();
Object o;
while ((o = beanReader.read(beanClass, header, processors)) != null) {
final Integer hashCode = o.hashCode();
// get the row no's for the hash (create if required)
Set<Integer> rowNumbers = rowNumbersByHash.get(hashCode);
if (rowNumbers == null) {
rowNumbers = new HashSet<Integer>();
rowNumbersByHash.put(hashCode, rowNumbers);
}
// add the current row number to its hash
final Integer rowNumber = beanReader.getRowNumber();
rowNumbers.add(rowNumber);
if (rowNumbers.size() == 2) {
duplicateHashes.add(hashCode);
}
}
// create a new map with just the duplicates
final Map<Integer, Set<Integer>> duplicateRowNumbersByHash =
new HashMap<Integer, Set<Integer>>();
for (Integer duplicateHash : duplicateHashes) {
duplicateRowNumbersByHash.put(duplicateHash,
rowNumbersByHash.get(duplicateHash));
}
return duplicateRowNumbersByHash;
} finally {
if (beanReader != null) {
beanReader.close();
}
}
}
As an alternative to this method, you could use a CsvListReader and make use of getUntokenizedRow().hashCode() - this would calculate a hash based on the raw CSV String (it would be a lot faster but your data may have subtle differences that mean that wouldn't work).
Pass 2: Report on duplicates
This method takes the output of the previous method and uses it to quickly identify duplicate records and the other rows that it duplicates.
/**
* Reports the details of duplicate records.
*
* #param reader
* the reader
* #param preference
* the preferences
* #param beanClass
* the bean class
* #param processors
* the cell processors
* #param duplicateRowNumbersByHash
* the row numbers of duplicate records
* #throws IOException
*/
private static void reportDuplicates(final Reader reader,
final CsvPreference preference, final Class<?> beanClass,
final CellProcessor[] processors,
final Map<Integer, Set<Integer>> duplicateRowNumbersByHash)
throws IOException {
ICsvBeanReader beanReader = null;
try {
beanReader = new CsvBeanReader(reader, preference);
final String[] header = beanReader.getHeader(true);
Object o;
while ((o = beanReader.read(beanClass, header, processors)) != null) {
final Set<Integer> duplicateRowNumbers =
duplicateRowNumbersByHash.get(o.hashCode());
if (duplicateRowNumbers != null) {
System.out.println(String.format(
"row %d is a duplicate of rows %s, line content: %s",
beanReader.getRowNumber(),
duplicateRowNumbers,
beanReader.getUntokenizedRow()));
}
}
} finally {
if (beanReader != null) {
beanReader.close();
}
}
}
Sample
Here's an example of how the 2 methods are used.
// rows (2,4,8) and (3,7) are duplicates
private static final String CSV = "a,b,c\n" + "1,two,01/02/2013\n"
+ "2,two,01/02/2013\n" + "1,two,01/02/2013\n"
+ "3,three,01/02/2013\n" + "4,four,01/02/2013\n"
+ "2,two,01/02/2013\n" + "1,two,01/02/2013\n";
private static final CellProcessor[] PROCESSORS = { new ParseInt(),
new NotNull(), new ParseDate("dd/MM/yyyy") };
public static void main(String[] args) throws IOException {
final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = findDuplicates(
new StringReader(CSV), CsvPreference.STANDARD_PREFERENCE,
Bean.class, PROCESSORS);
reportDuplicates(new StringReader(CSV),
CsvPreference.STANDARD_PREFERENCE, Bean.class, PROCESSORS,
duplicateRowNumbersByHash);
}
Output:
row 2 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 3 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 4 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 7 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 8 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013

Related

How to get new userinput in a stream while its running using Java8

I need to validate user input and if it doesn't meet the conditions then I need to replace it with correct input. So far I am stuck on two parts. Im fairly new to java8 and not so familiar with all the libraries so if you can give me advice on where to read up more on these I would appreciate it.
List<String> input = Arrays.asList(args);
List<String> validatedinput = input.stream()
.filter(p -> {
if (p.matches("[0-9, /,]+")) {
return true;
}
System.out.println("The value has to be positve number and not a character");
//Does the new input actually get saved here?
sc.nextLine();
return false;
}) //And here I am not really sure how to map the String object
.map(String::)
.validatedinput(Collectors.toList());
This type of logic shouldn't be done with streams, a while loop would be a good candidate for it.
First, let's partition the data into two lists, one list representing the valid inputs and the other representing invalid inputs:
Map<Boolean, List<String>> resultSet =
Arrays.stream(args)
.collect(Collectors.partitioningBy(s -> s.matches(yourRegex),
Collectors.toCollection(ArrayList::new)));
Then create the while loop to ask the user to correct all their invalid inputs:
int i = 0;
List<String> invalidInputs = resultSet.get(false);
final int size = invalidInputs.size();
while (i < size){
System.out.println("The value --> " + invalidInputs.get(i) +
" has to be positive number and not a character");
String temp = sc.nextLine();
if(temp.matches(yourRegex)){
resultSet.get(true).add(temp);
i++;
}
}
Now, you can collect the list of all the valid inputs and do what you like with it:
List<String> result = resultSet.get(true);

The probability distribution of two words in a file using java 8

I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.

Sorting large amount of strings and grouping

I'm trying to find a set of unique strings and break it up into disjoint groups by criterium: if two strings have coincidence at 1 or more columns, it belongs to one group.
For example
111;123;222
200;123;100
300;;100
All of them are belong to one group, cause of overlap at:
first string with second by "123" value
second string with third by "100" value
After getting these groups, I need to save them to a text file.
I got 60 MB file with strings, which should be sorted.(time limit: 30 sec)
I think. First, the best way is to divide strings into columns, and then try to find any coincidence, but I'm not sure at all.
Please, help me to find the solution.
For now, I have this code; it works about 2.5-3 sec:
// getting from file
File file = new File(path);
InputStream inputFS = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
List<String> inputList = br.lines().collect(Collectors.toList());
br.close();
List<String> firstValues = new ArrayList<>();
List<String> secondValues = new ArrayList<>();
List<String> thirdValues = new ArrayList<>();
// extracting distinct values and splitting
final String qq = "\"";
inputList.stream()
.map(s -> s.split(";"))
.forEach(strings -> {
firstValues.add(strings.length > 0 ? strings[0].replaceAll(qq, "") : null);
secondValues.add(strings.length > 1 ? strings[1].replaceAll(qq, "") : null);
thirdValues.add(strings.length > 2 ? strings[2].replaceAll(qq, "") : null);
});
// todo: add to maps by the row and then find groups

Most efficient way to determine if there are any differences between specific properties of 2 lists of items?

In C# .NET 4.0, I am struggling to come up with the most efficient way to determine if the contents of 2 lists of items contain any differences.
I don't need to know what the differences are, just true/false whether the lists are different based on my criteria.
The 2 lists I am trying to compare contain FileInfo objects, and I want to compare only the FileInfo.Name and FileInfo.LastWriteTimeUtc properties of each item. All the FileInfo items are for files located in the same directory, so the FileInfo.Name values will be unique.
To summarize, I am looking for a single Boolean result for the following criteria:
Does ListA contain any items with FileInfo.Name not in ListB?
Does ListB contain any items with FileInfo.Name not in ListA?
For items with the same FileInfo.Name in both lists, are the FileInfo.LastWriteTimeUtc values different?
Thank you,
Kyle
I would use a custom IEqualityComparer<FileInfo> for this task:
public class FileNameAndLastWriteTimeUtcComparer : IEqualityComparer<FileInfo>
{
public bool Equals(FileInfo x, FileInfo y)
{
if(Object.ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return x.FullName.Equals(y.FullName) && x.LastWriteTimeUtc.Equals(y.LastWriteTimeUtc);
}
public int GetHashCode(FileInfo fi)
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
hash = hash * 23 + fi.FullName.GetHashCode();
hash = hash * 23 + fi.LastWriteTimeUtc.GetHashCode();
return hash;
}
}
}
Now you can use a HashSet<FileInfo> with this comparer and HashSet<T>.SetEquals:
var comparer = new FileNameAndLastWriteTimeUtcComparer();
var uniqueFiles1 = new HashSet<FileInfo>(list1, comparer);
bool anyDifferences = !uniqueFiles1.SetEquals(list2);
Note that i've used FileInfo.FullName instead of Name since names aren't unqiue at all.
Sidenote: another advantage is that you can use this comparer for many LINQ methods like GroupBy, Except, Intersect or Distinct.
This is not the most efficient way (probably ranks a 4 out of 5 in the quick-and-dirty category):
var comparableListA = ListA.Select(a =>
new { Name = a.Name, LastWrite = a.LastWriteTimeUtc, Object = a});
var comparableListB = ListB.Select(b =>
new { Name = b.Name, LastWrite = b.LastWriteTimeUtc, Object = b});
var diffList = comparableListA.Except(comparableListB);
var youHaveDiff = diffList.Any();
Explanation:
Anonymous classes are compared by property values, which is what you're looking to do, which led to my thinking of doing a LINQ projection along those lines.
P.S.
You should double check the syntax, I just rattled this off without the compiler.

How to add two fields in mongoDB using java driver

db.student.aggregate([{$project:{rollno:1,per:{$divide:[{$add:["marks1","$marks2","$marks3"]},3]}}}])
how to write this query in java????? here,student is collection with fields rollno,name and marks and i have to find the percentage of the students according to their roll numbers . I am not able to write code for adding their marks as add operator does not support multiple value's for addition.
This seems to work. There are other builder pattern/conveniences but this exposes all the workings. And leaves room for rich dynamic construction.
DBCollection coll = db.getCollection("student");
List<DBObject> pipe = new ArrayList<DBObject>();
/**
* Clearly, lots of room for dynamic behavior here.
* Different sets of marks, etc. And the divisor is
* the length of these, etc.
*/
String[] marks = new String[]{"$marks1","$marks2","$marks3"};
DBObject add = new BasicDBObject("$add", marks);
List l2 = new ArrayList();
l2.add(add);
l2.add(marks.length); // 3
DBObject divide = new BasicDBObject("$divide", l2);
DBObject prjflds = new BasicDBObject();
prjflds.put("rollno", 1);
prjflds.put("per", divide);
DBObject project = new BasicDBObject();
project.put("$project", prjflds);
pipe.add(project);
AggregationOutput agg = coll.aggregate(pipe);
for (DBObject result : agg.results()) {
System.out.println(result);
}

Resources