How to optimize this linq to objects query? - linq

I'm matching some in memory lists entities with a .contains (subselect) query to filter out old from new users.
Checking for performance problems i saw this:
The oldList mostly has around 1000 users in them, while the new list varies from 100 to 500. Is there a way to optimize this query?

Absolutely - build a set instead of checking a list each time:
// Change string to whatever the type of UserID is.
var oldUserSet = new HashSet<string>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldUserSet.Contains(n.UserID))
.ToList();
The containment check on a HashSet should be O(1) assuming few hash collisions, instead of the O(N) of checking each against the whole sequence (for each new user).

You could make a HashSet<T> of your user IDs in advance. This will cause the Contains to become an O(1) operation:
var oldSet = new HashSet<int>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldSet.Contains(n.UserID)).ToList();

While those HashSet<T> answers are straightforward and simple, some may prefer a linq-centric solution.
LinqToObjects implements join and GroupJoin with a HashSet. Just use one of those - this example uses GroupJoin:
List<User> newUsers =
(
from n in NewList
join o in oldList on n.UserId equals o.UserId into oldGroup
where !oldGroup.Any()
select n
).ToList()

Related

LINQ: Improving performance of "query to find all dictionaries from list of dictionaries where given key has at least one value from list of values"

I tried searching for existing questions, but I could not find anything, so apologize if this is duplicate question.
I have following piece of code. This code runs in a loop for different values of key and listOfValues (listOfDict does not change and built only once, key and listOfValues vary for each iteration). This code currently works, but profiler shows that 50% of the execution time is spent in this LINQ query. Can I improve performance - using different LINQ construct perhaps?
// List of dictionary that allows multiple values against one key.
List<Dictionary<string, List<string>>> listOfDict = BuildListOfDict();
// Following code & LINQ query runs in a loop.
List<string> listOfValues = BuildListOfValues();
string key = GetKey();
// LINQ query to find all dictionaries from listOfDict
// where given key has at least one value from listOfValues.
List<Dictionary<string, List<string>>> result = listOfDict
.Where(dict => dict[key]
.Any(lhs => listOfValues.Any(rhs => lhs == rhs)))
.ToList();
Using HashSet will perform significantly better. You can create a HashSet<string> like so:
IEnumerable<string> strings = ...;
var hashSet = new HashSet<string>(strings);
I assume you can change your methods to return HashSets and make them run like this:
List<Dictionary<string, HashSet<string>>> listOfDict = BuildListOfDict();
HashSet<string> listOfValues = BuildListOfValues();
string key = GetKey();
List<Dictionary<string, HashSet<string>>> result = listOfDict
.Where(dict => listOfValues.Overlaps(dict[key]))
.ToList();
Here HashSet's instance method Overlaps is used. HashSet is optimized for set operations like this. In a test using one dictionary of 200 elements this runs in 3% of the time compared to your method.
UPDATED: Per #GertArnold, switched from Any/Contains to HashSet.Overlaps for slight performance improvement.
Depending on whether listOfValues or the average value for a key is longer, you can either convert listOfValues to a HashSet<string> or build your list of dictionaries to have a HashSet<string> for each value:
// optimize testing against listOfValues
var valHS = listOfValues.ToHashSet();
var result2 = listOfDict.Where(dict => valHS.Overlaps(dict[key]))
.ToList();
// change structure to optimize query
var listOfDict2 = listOfDict.Select(dict => dict.ToDictionary(kvp => kvp.Key, kvp => kvp.Value.ToHashSet())).ToList();
var result3 = listOfDict2.Where(dict => dict[key].Overlaps(listOfValues))
.ToList();
Note: if the query is repeated with differing listOfValues, it probably makes more sense to build the HashSet in the dictionaries once, rather than computing a HashSet from each listOfValues.
#LasseVågsætherKarlsen suggestion in comments to invert the structure intrigued me, so with a further refinement to handle the multiple keys, I created an index structure and tested lookups. With my Test Harness, this is about twice as fast as using a HashSet for one of the List<string>s and four times faster than the original method:
var listOfKeys = listOfDict.First().Select(d => d.Key);
var lookup = listOfKeys.ToDictionary(k => k, k => listOfDict.SelectMany(d => d[k].Select(v => (v, d))).ToLookup(vd => vd.v, vd => vd.d));
Now to filter for a particular key and list of values:
var result4 = listOfValues.SelectMany(v => lookup[key][v]).Distinct().ToList();

Finding strings that are not in DB already

I have some bad performance issues in my application. One of the big operations is comparing strings.
I download a list of strings, approximately 1000 - 10000. These are all unique strings.
Then I need to check if these strings already exists in the database.
The linq query that I'm using looks like this:
IEnumerable<string> allNewStrings = DownloadAllStrings();
var selection = from a in allNewStrings
where !(from o in context.Items
select o.TheUniqueString).Contains(a)
select a;
Am I doing something wrong or how could I make this process faster preferably with Linq?
Thanks.
You did query the same unique strings 1000 - 10000 times for every element in allNewStrings, so it's extremely inefficient.
Try to query unique strings separately in order that it is executed once:
IEnumerable<string> allNewStrings = DownloadAllStrings();
var uniqueStrings = from o in context.Items
select o.TheUniqueString;
var selection = from a in allNewStrings
where !uniqueStrings.Contains(a)
select a;
Now you can see that the last query could be written using Except which is more efficient for the case of set operators like your example:
var selection = allNewStrings.Except(uniqueStrings);
An alternative solution would be to use a HashSet:
var set = new HashSet<string>(DownloadAllStrings());
set.ExceptWith(context.Items.Select(s => s.TheUniqueString));
The set will now contain the the strings that are not in the DB.

EF - Linq Expression and using a List of Ints to get best performance

So I have a list(table) of about 100k items and I want to retrieve all values that match a given list.
I have something like this.
the Table Sections key is NOT a primary key, so I'm expecting each value in listOfKeys to return a few rows.
List<int> listOfKeys = new List<int>(){1,3,44};
var allSections = Sections.Where(s => listOfKeys.Contains(s.id));
I don't know if it makes a difference but generally listOfKeys will only have between 1 to 3 items.
I'm using the Entity Framework.
So my question is, is this the best / fastest way to include a list in a linq expression?
I'm assuming that it isn't better to use another .NETICollection data object. Should I be using a Union or something?
Thanks
Suppose the listOfKeys will contain only small about of items and it's local list (not from database), like <50, then it's OK. The query generated will be basically WHERE id in (...) or WHERE id = ... OR id = ... ... and that's OK for database engine to handle it.
A Join would probably be more efficient:
var allSections =
from s in Sections
join k in listOfKeys on s.id equals k
select s;
Or, if you prefer the extension method syntax:
var allSections = Sections.Join(listOfKeys, s => s.id, k => k, (s, k) => s);

minimum value in dictionary using linq

I have a dictionary of type
Dictionary<DateTime,double> dictionary
How can I retrive a minimum value and key coresponding to this value from this dictionary using linq ?
var min = dictionary.OrderBy(kvp => kvp.Value).First();
var minKey = min.Key;
var minValue = min.Value;
This is not very efficient though; you might want to consider MoreLinq's MinBy extension method.
If you are performing this query very often, you might want to consider a different data-structure.
Aggregate
var minPair = dictionary.Aggregate((p1, p2) => (p1.Value < p2.Value) ? p1 : p2);
Using the mighty Aggregate method.
I know that MinBy is cleaner in this case, but with Aggregate you have more power and its built-in. ;)
Dictionary<DateTime, double> dictionary;
//...
double min = dictionary.Min(x => x.Value);
var minMatchingKVPs = dictionary.Where(x => x.Value == min);
You could combine it of course if you really felt like doing it on one line, but I think the above is easier to read.
var minMatchingKVPs = dictionary.Where(x => x.Value == dictionary.Min(y => y.Value));
You can't easily do this efficiently in normal LINQ - you can get the minimal value easily, but finding the key requires another scan through. If you can afford that, use Jess's answer.
However, you might want to have a look at MinBy in MoreLINQ which would let you write:
var pair = dictionary.MinBy(x => x.Value);
You'd then have the pair with both the key and the value in, after just a single scan.
EDIT: As Nappy says, MinBy is also in System.Interactive in Reactive Extensions.

Combining 2 Linq queries into 1

Given the following information, how can I combine these 2 linq queries into 1. Having a bit of trouble with the join statement.
'projectDetails' is just a list of ProjectDetails
ProjectDetails (1 to many) PCardAuthorizations
ProjectDetails (1 to many) ExpenditureDetails
Notice I am grouping by the same information and selecting the same type of information
var pCardAccount = from c in PCardAuthorizations
where projectDetails.Contains(c.ProjectDetail)
&& c.RequestStatusId == 2
group c by new { c.ProjectDetail, c.ProgramFund } into g
select new { Key = g.Key, Sum = g.Sum(x => x.Amount) };
var expenditures = from d in ExpenditureDetails
where projectDetails.Contains(d.ProjectDetails)
&& d.Expenditures.ExpenditureTypeEnum == 0
group d by new { d.ProjectDetails, d.ProgramFunds } into g
select new {
Key = g.Key,
Sum = g.Sum(y => y.ExpenditureAmounts.FirstOrDefault(a => a.IsCurrent && !a.RequiresAudit).CommittedMonthlyRecords.ProjectedEac)
};
I don't think Mike's suggestion will work here. These two queries are sufficiently different that combining them into one query will just make it more difficult to read.
The objects have different types.
The where clauses are different.
The group by clause is different:
new { c.ProjectDetail, c.ProgramFund }
new { c.ProjectDetails, c.ProgramFunds }
The select clauses are different.
In fact there isn't really anything that is the same. I'd recommend leaving it as it is.
You could always just concat them together before evaluating them.
pCardAccount.Concat(expenditures).ToArray()
should generate a single sql statement with a union. As to whether there is any benefit to this in your situation, I don't know. There's also a chance that linq-to-sql won't be able to generate the SQL for the concat and will throw an exception whenever you use it. I'm not sure what causes it, but I've seen it a few times when I was playing around in a similar situation.
Edit: I just noticed that your keys are different in each of the queries. I'm not sure if this is a typo or not, but if it isn't, you won't be able to concat them due to the different types and it wouldn't make any sense to anyway

Resources