Find / Count Redundant Records in a List<T> - linq

I am looking for a way to identify duplicate records...only I want / expect to see them.
So the records aren't duplicated completely but the unique fields I am unconcerned with at this point. I just want to see if they have made X# payments of the exact same amount, via the exact same card, to the exact same person. (Bogus example just to illustrate)
The collection is a List<> further whatever X# is the List<>.Count will be X#. In other words all the records in the list match (again just the fields I am concerned with) or I will reject it.
The best I can come up with is to take the first record get value of say PayAmount and LINQ the other two to see if they have the same PayAmount value. Repeat for all fields to be matched. This seems horribly inefficient but I am not smart enough to think of a better way.
So any thoughts, ideas, pointers would be greatly appreciated.
JB

Something like this should do it.
var duplicates = list.GroupBy(x => new { x.Amount, x.CardNumber, x.PersonName })
.Where(x => x.Count() > 1);

Working example:
class Program
{
static void Main(string[] args)
{
List<Entry> table = new List<Entry>();
var dup1 = new Entry
{
Name = "David",
CardNumber = 123456789,
PaymentAmount = 70.00M
};
var dup2 = new Entry
{
Name = "Daniel",
CardNumber = 987654321,
PaymentAmount = 45.00M
};
//3 duplicates
table.Add(dup1);
table.Add(dup1);
table.Add(dup1);
//2 duplicates
table.Add(dup2);
table.Add(dup2);
//Find duplicates query
var query = from p in table
group p by new { p.Name, p.CardNumber, p.PaymentAmount } into g
where g.Count() > 1
select new
{
name = g.Key.Name,
cardNumber = g.Key.CardNumber,
amount = g.Key.PaymentAmount,
count = g.Count()
};
foreach (var item in query)
{
Console.WriteLine("{0}, {1}, {2}, {3}", item.name, item.cardNumber, item.amount, item.count);
}
Console.ReadKey();
}
}
public class Entry
{
public string Name { get; set; }
public int CardNumber { get; set; }
public decimal PaymentAmount { get; set; }
}
The meat of which is this:
var query = from p in table
group p by new { p.Name, p.CardNumber, p.PaymentAmount } into g
where g.Count() > 1
select new
{
name = g.Key.Name,
cardNumber = g.Key.CardNumber,
amount = g.Key.PaymentAmount,
count = g.Count()
};
You're unique entries are based off of the 3 criteria of Name, Card Number, and Payment Amount so you group by them and then use .Count() to count how many of those unique values exist. where g.Count() > 1 filters the group to duplicates only.

Related

Comparing two lists with multiple conditions

I have two different lists of same type. I wanted to compare both lists and need to get the values which are not matched.
List of class:
public class pre
{
public int id {get; set;}
public datetime date {get; set;}
public int sID {get; set;}
}
Two lists :
List<pre> pre1 = new List<pre>();
List<pre> pre2 = new List<pre>();
Query which I wrote to get the unmatched values:
var preResult = pre1.where(p1 => !pre
.any(p2 => p2.id == p1.id && p2.date == p1.date && p2.sID == p1sID));
But the result is wrong here. I am getting all the values in pre1.
Here is solution :
class Program
{
static void Main(string[] args)
{
var pre1 = new List<pre>()
{
new pre {id = 1, date =DateTime.Now.Date, sID=1 },
new pre {id = 7, date = DateTime.Now.Date, sID = 2 },
new pre {id = 9, date = DateTime.Now.Date, sID = 3 },
new pre {id = 13, date = DateTime.Now.Date, sID = 4 },
// ... etc ...
};
var pre2 = new List<pre>()
{
new pre {id = 1, date =DateTime.Now.Date, sID=1 },
// ... etc ...
};
var preResult = pre1.Where(p1 => !pre2.Any(p2 => p2.id == p1.id && p2.date == p1.date && p2.sID == p1.sID)).ToList();
Console.ReadKey();
}
}
Note:Property date contain the date and the time part will be 00:00:00.
I fixed some typos and tested your code with sensible values, and your code would correctly select unmatched records. As prabhakaran S's answer mentions, perhaps your date values include time components that differ. You will need to check your data and decide how to proceed.
However, a better way to select unmatched records from one list compared against another would be to utilize a left join technique common to working with relational databases, which you can also do in Linq against in-memory collections. It will scale better as the sizes of your inputs grow.
var preResult = from p1 in pre1
join p2 in pre2
on new { p1.id, p1.date, p1.sID }
equals new { p2.id, p2.date, p2.sID } into grp
from item in grp.DefaultIfEmpty()
where item == null
select p1;

enumerable group field using Linq?

I've written a Linq sentence like this:
var fs = list
.GroupBy(i =>
new {
X = i.X,
Ps = i.Properties.Where(p => p.Key.Equals("m")) <<<<<<<<<<<
}
)
.Select(g => g.Key });
Am I able to group by IEnumerable.Where(...) fields?
The grouping won't work here.
When grouping, the runtime will try to compare group keys in order to produce proper groups. However, since in the group key you use a property (Ps) which is a distinct IEnumerable<T> for each item in list (the comparison is made on reference equality not on sequence equality) this will result in a different collection for each element; in other words if you'll have two items:
var a = new { X = 1, Properties = new[] { "m" } };
var b = new { X = 1, Properties = new[] { "m" } };
The GroupBy clause will give you two distinct keys as you can see from the image below.
If your intent is to just project the items into the structure of the GroupBy key then you don't need the grouping; the query below should give the same result:
var fs = list.Select(item => new
{
item.X,
Ps = item.Properties.Where(p => p.Key == "m")
});
However, if you do require the results to be distinct, you'll need to create a separate class for your result and implement a separate IEqualityComparer<T> to be used with Distinct clause:
public class Result
{
public int X { get; set; }
public IEnumerable<string> Ps { get; set; }
}
public class ResultComparer : IEqualityComparer<Result>
{
public bool Equals(Result a, Result b)
{
return a.X == b.X && a.Ps.SequenceEqual(b.Ps);
}
// Implement GetHashCode
}
Having the above you can use Distinct on the first query to get distinct results:
var fs = list.Select(item => new Result
{
X = item.X,
Ps = item.Properties.Where( p => p.Key == "m")
}).Distinct(new ResultComparer());

Linq join two lists: is it more efficient to use Dictionary?

Final rephrase
Below I join two sequences and I wondered if it would be faster to create a Dictionary of one sequence with the keySelector of the join as key and iterate through the other collection and find the key in the dictionary.
This only works if the key selector is unique. A real join has no problem with two records having the same key. In a dictionary you'll have to have unique keys
I measured the difference, and I noticed that the dictionary method is about 13% faster. In most use cases ignorable. See my answer to this question
Rephrased question
Some suggested that this question is the same question as LINQ - Using where or join - Performance difference?, but this one is not about using where or join, but about using a Dictionary to perform the join.
My question is: if I want to join two sequences based on a key selector, which method would be faster?
Put all items of one sequence in a Dictionary and enumerate the other sequence to see if the item is in the Dictionary. This would mean to iterate through both sequences once and calculate hash codes on the keySelector for every item in both sequences once.
The other method: use System.Enumerable.Join.
The question is: Would Enumerable.Join for each element in the first list iterate through the elements in the second list to find a match according to the key selector, having to compare N * N elements (is this called second order?) or would it use a more advanced method?
Original question with examples
I have two classes, both with a property Reference. I have two sequences of these classes and I want to join them based on equal Reference.
Class ClassA
{
public string Reference {get;}
...
}
public ClassB
{
public string Reference {get;}
...
}
var listA = new List<ClassA>()
{
new ClassA() {Reference = 1, ...},
new ClassA() {Reference = 2, ...},
new ClassA() {Reference = 3, ...},
new ClassA() {Reference = 4, ...},
}
var listB = new List<ClassB>()
{
new ClassB() {Reference = 1, ...},
new ClassB() {Reference = 3, ...},
new ClassB() {Reference = 5, ...},
new ClassB() {Reference = 7, ...},
}
After the join I want combinations of ClassA objects and ClassB objects that have an equal Reference. This is quite simple to do:
var myJoin = listA.Join(listB, // join listA and listB
a => a.Reference, // from listA take Reference
b => b.Reference, // from listB take Reference
(objectA, objectB) => // if references equal
new {A = objectA, B = objectB}); // return combination
I'm not sure how this works, but I can imagine that for each a in listA the listB is iterated to see if there is a b in listB with the same reference as A.
Question: if I know that the references are Distinct wouldn't it be more efficient to convert B into a Dictionary and compare the Reference for each element in listA:
var dictB = listB.ToDictionary<string, ClassB>()
var myJoin = listA
.Where(a => dictB.ContainsKey(a.Reference))
.Select(a => new (A = a, B = dictB[a.Reference]);
This way, every element of listB has to be accessed once to put in the dictionary and every element of listA has to be accessed once, and the hascode of Reference has to be calculated once.
Would this method be faster for large collections?
I created a test program for this and measured the time it took.
Suppose I have a class of Person, each person has a name and a Father property which is of type Person. If the Father is not know, the Father property is null
I have a sequence of Bastards (no father) that have exactly one Son and One Daughter. All Daughters are put in one sequence. All sons are put in another sequences.
The query: join the sons and the daughters that have the same father.
Results: Joining 1 million families using Enumerable.Join took 1.169 sec. Joining them using Dictionary join used 1.024 sec. Ever so slightly faster.
The code:
class Person : IEquatable<Person>
{
public string Name { get; set; }
public Person Father { get; set; }
// + a lot of equality functions get hash code etc
// for those interested: see the bottom
}
const int nrOfBastards = 1000000; // one million
var bastards = Enumerable.Range (0, nrOfBastards)
.Select(i => new Person()
{ Name = 'B' + i.ToString(), Father = null })
.ToList();
var sons = bastards.Select(father => new Person()
{Name = "Son of " + father.Name, Father = father})
.ToList();
var daughters = bastards.Select(father => new Person()
{Name = "Daughter of " + father.Name, Father = father})
.ToList();
// join on same parent: Traditionally and using Dictionary
var stopwatch = Stopwatch.StartNew();
this.TraditionalJoin(sons, daughters);
var time = stopwatch.Elapsed;
Console.WriteLine("Traditional join of {0} sons and daughters took {1:F3} sec", nrOfBastards, time.TotalSeconds);
stopwatch.Restart();
this.DictionaryJoin(sons, daughters);
time = stopwatch.Elapsed;
Console.WriteLine("Dictionary join of {0} sons and daughters took {1:F3} sec", nrOfBastards, time.TotalSeconds);
}
private void TraditionalJoin(IEnumerable<Person> boys, IEnumerable<Person> girls)
{ // join on same parent
var family = boys
.Join(girls,
boy => boy.Father,
girl => girl.Father,
(boy, girl) => new { Son = boy.Name, Daughter = girl.Name })
.ToList();
}
private void DictionaryJoin(IEnumerable<Person> sons, IEnumerable<Person> daughters)
{
var sonsDictionary = sons.ToDictionary(son => son.Father);
var family = daughters
.Where(daughter => sonsDictionary.ContainsKey(daughter.Father))
.Select(daughter => new { Son = sonsDictionary[daughter.Father], Daughter = daughter })
.ToList();
}
For those interested in the equality of Persons, needed for a proper dictionary:
class Person : IEquatable<Person>
{
public string Name { get; set; }
public Person Father { get; set; }
public bool Equals(Person other)
{
if (other == null)
return false;
else if (Object.ReferenceEquals(this, other))
return true;
else if (this.GetType() != other.GetType())
return false;
else
return String.Equals(this.Name, other.Name, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
return this.Equals(obj as Person);
}
public override int GetHashCode()
{
const int prime1 = 899811277;
const int prime2 = 472883293;
int hash = prime1;
unchecked
{
hash = hash * prime2 + this.Name.GetHashCode();
if (this.Father != null)
{
hash = hash * prime2 + this.Father.GetHashCode();
}
}
return hash;
}
public override string ToString()
{
return this.Name;
}
public static bool operator==(Person x, Person y)
{
if (Object.ReferenceEquals(x, null))
return Object.ReferenceEquals(y, null);
else
return x.Equals(y);
}
public static bool operator!=(Person x, Person y)
{
return !(x==y);
}
}

Linq to CSV select by column value

I know I have asked this question in a different manner earlier today but I have refined my needs a little better.
Given the following csv file where the first column is the title and there could be any number of columns;
year,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
income,1000,1500,2000,2100,2100,2100,2100,2100,2100,2100
dividends,100,200,300,300,300,300,300,300,300,300
net profit,1100,1700,2300,2400,2400,2400,2400,2400,2400,2400
expenses,500,600,500,400,400,400,400,400,400,400
profit,600,1100,1800,2000,2000,2000,2000,2000,2000,2000
How do I select the profit value for a given year? So I may provide a year of say 2011 and expect to get the profit value of 2000 back.
At the moment I have this which shows the profit value for each year but ideally I'd like to specify the year and get the profit value;
var data = File.ReadAllLines(fileName)
.Select(
l => {
var split = l.Split(",".ToCharArray());
return split;
}
);
var profit = (from p in data where p[0] == profitFieldName select p).SingleOrDefault();
var years = (from p in data where p[0] == yearFieldName select p).FirstOrDefault();
int columnCount = years.Count() ;
for (int t = 1; t < columnCount; t++)
Console.WriteLine("{0} : ${1}", years[t], profit[t]);
I've already answered this once today, but this answer is a little more fleshed out and hopefully clearer.
string rowName = "profit";
string year = "2011";
var yearRow = data.First();
var yearIndex = Array.IndexOf(yearRow, year);
// get your 'profits' row, or whatever row you want
var row = data.Single(d => d[0] == rowName);
// return the appropriate index for that row.
return row[yearIndex];
This works for me.
You have an unfortunate data format, but I think the best thing to do is just to define a class, create a list, and then use your inputs to create objects to add to the list. Then you can do whatever querying you need to get your desired results.
class MyData
{
public string Year { get; set; }
public decimal Income { get; set; }
public decimal Dividends { get; set; }
public decimal NetProfit { get; set; }
public decimal Expenses { get; set; }
public decimal Profit { get; set; }
}
// ...
string dataFile = #"C:\Temp\data.txt";
List<MyData> list = new List<MyData>();
using (StreamReader reader = new StreamReader(dataFile))
{
string[] years = reader.ReadLine().Split(',');
string[] incomes = reader.ReadLine().Split(',');
string[] dividends = reader.ReadLine().Split(',');
string[] netProfits = reader.ReadLine().Split(',');
string[] expenses = reader.ReadLine().Split(',');
string[] profits = reader.ReadLine().Split(',');
for (int i = 1; i < years.Length; i++) // index 0 is a title
{
MyData myData = new MyData();
myData.Year = years[i];
myData.Income = decimal.Parse(incomes[i]);
myData.Dividends = decimal.Parse(dividends[i]);
myData.NetProfit = decimal.Parse(netProfits[i]);
myData.Expenses = decimal.Parse(expenses[i]);
myData.Profit = decimal.Parse(profits[i]);
list.Add(myData);
}
}
// query for whatever data you need
decimal maxProfit = list.Max(data => data.Profit);

GroupBy String and Count in LINQ

I have got a collection. The coll has strings:
Location="Theater=1, Name=regal, Area=Area1"
Location="Theater=34, Name=Karm, Area=Area4445"
and so on. I have to extract just the Name bit from the string. For example, here I have to extract the text 'regal' and group the query. Then display the result as
Name=regal Count 33
Name=Karm Count 22
I am struggling with the query:
Collection.Location.GroupBy(????);(what to add here)
Which is the most short and precise way to do it?
Yet another Linq + Regex approach:
string[] Location = {
"Theater=2, Name=regal, Area=Area1",
"Theater=2, Name=regal, Area=Area1",
"Theater=34, Name=Karm, Area=Area4445"
};
var test = Location.Select(
x => Regex.Match(x, "^.*Name=(.*),.*$")
.Groups[1].Value)
.GroupBy(x => x)
.Select(x=> new {Name = x.Key, Count = x.Count()});
Query result for tested strings
Once you've extracted the string, just group by it and count the results:
var query = from location in locations
let name = ExtractNameFromLocation(location)
group 1 by name in grouped
select new { Name=grouped.Key, Count=grouped.Count() };
That's not particularly efficient, however. It has to do all the grouping before it does any counting. Have a look at this VJ article for an extension method for LINQ to Objects,
and this one about Push LINQ which a somewhat different way of looking at LINQ.
EDIT: ExtractNameFromLocation would be the code taken from answers to your other question, e.g.
public static string ExtractNameFromLocation(string location)
{
var name = (from part in location.Split(',')
let pair = part.Split('=')
where pair[0].Trim() == "Name"
select pair[1].Trim()).FirstOrDefault();
return name;
}
Here is another LINQ alternative solution with a working example.
static void Main(string[] args)
{
System.Collections.Generic.List<string> l = new List<string>();
l.Add("Theater=1, Name=regal, Area=Area"); l.Add("Theater=34, Name=Karm, Area=Area4445");
foreach (IGrouping<string, string> g in l.GroupBy(r => extractName(r)))
{
Console.WriteLine( string.Format("Name= {0} Count {1}", g.Key, g.Count()) );
}
}
private static string extractName(string dirty)
{
System.Text.RegularExpressions.Match m =
System.Text.RegularExpressions.Regex.Match(
dirty, #"(?<=Name=)[a-zA-Z0-9_ ]+(?=,)");
return m.Success ? m.Value : "";
}

Resources