LINQ: Improving performance of "query to find all dictionaries from list of dictionaries where given key has at least one value from list of values" - performance

I tried searching for existing questions, but I could not find anything, so apologize if this is duplicate question.
I have following piece of code. This code runs in a loop for different values of key and listOfValues (listOfDict does not change and built only once, key and listOfValues vary for each iteration). This code currently works, but profiler shows that 50% of the execution time is spent in this LINQ query. Can I improve performance - using different LINQ construct perhaps?
// List of dictionary that allows multiple values against one key.
List<Dictionary<string, List<string>>> listOfDict = BuildListOfDict();
// Following code & LINQ query runs in a loop.
List<string> listOfValues = BuildListOfValues();
string key = GetKey();
// LINQ query to find all dictionaries from listOfDict
// where given key has at least one value from listOfValues.
List<Dictionary<string, List<string>>> result = listOfDict
.Where(dict => dict[key]
.Any(lhs => listOfValues.Any(rhs => lhs == rhs)))
.ToList();

Using HashSet will perform significantly better. You can create a HashSet<string> like so:
IEnumerable<string> strings = ...;
var hashSet = new HashSet<string>(strings);
I assume you can change your methods to return HashSets and make them run like this:
List<Dictionary<string, HashSet<string>>> listOfDict = BuildListOfDict();
HashSet<string> listOfValues = BuildListOfValues();
string key = GetKey();
List<Dictionary<string, HashSet<string>>> result = listOfDict
.Where(dict => listOfValues.Overlaps(dict[key]))
.ToList();
Here HashSet's instance method Overlaps is used. HashSet is optimized for set operations like this. In a test using one dictionary of 200 elements this runs in 3% of the time compared to your method.

UPDATED: Per #GertArnold, switched from Any/Contains to HashSet.Overlaps for slight performance improvement.
Depending on whether listOfValues or the average value for a key is longer, you can either convert listOfValues to a HashSet<string> or build your list of dictionaries to have a HashSet<string> for each value:
// optimize testing against listOfValues
var valHS = listOfValues.ToHashSet();
var result2 = listOfDict.Where(dict => valHS.Overlaps(dict[key]))
.ToList();
// change structure to optimize query
var listOfDict2 = listOfDict.Select(dict => dict.ToDictionary(kvp => kvp.Key, kvp => kvp.Value.ToHashSet())).ToList();
var result3 = listOfDict2.Where(dict => dict[key].Overlaps(listOfValues))
.ToList();
Note: if the query is repeated with differing listOfValues, it probably makes more sense to build the HashSet in the dictionaries once, rather than computing a HashSet from each listOfValues.

#LasseVågsætherKarlsen suggestion in comments to invert the structure intrigued me, so with a further refinement to handle the multiple keys, I created an index structure and tested lookups. With my Test Harness, this is about twice as fast as using a HashSet for one of the List<string>s and four times faster than the original method:
var listOfKeys = listOfDict.First().Select(d => d.Key);
var lookup = listOfKeys.ToDictionary(k => k, k => listOfDict.SelectMany(d => d[k].Select(v => (v, d))).ToLookup(vd => vd.v, vd => vd.d));
Now to filter for a particular key and list of values:
var result4 = listOfValues.SelectMany(v => lookup[key][v]).Distinct().ToList();

Related

how to get the linq list having Ids from IEnumerable<Object>

The code below userModel.Carriers is the type of IEnumerable<CarrierModel>.
userModel.Carriers has a list of carriers having Ids. With those Ids, I want to get CarrierDivision usign linq. But, I can't get it right because linq sqlexpression is not compatible with IEnumerable expression.
userModel.Carriers = carriersForRegion.Select(carrier => Mapper.Map<CarrierModel>(carrier))
.ToList();
var carrierDivision = from c in db.CarrierDivision where c.Contains();
collection.Contains will generate .. WHERE CarrierId IN (1, 2, 3) sql query
var carrierIds = userModel.Carriers.Select(carrier => carrier.Id).ToArray();
var divisions = db.CarrierDivision
.Where(division => carrierIds.Contains(division.CarrierId))
.ToArray();
In case db.CarrierDivision returns IEnumerable(not database), then I would suggest to create HashSet of carrier ids.
var carrierIds = userModel.Carriers.Select(carrier => carrier.Id).ToHashSet();
var divisions = db.CarrierDivision
.Where(division => carrierIds.Contains(division.CarrierId))
.ToArray();
With HashSet search executed without extra enumerations - O(1)

dynamic asc desc sort

I am trying to create table headers that sort during a back end call in nhibernate. When clicking the header it sends a string indicating what to sort by (ie "Name", "NameDesc") and sending it to the db call.
The db can get quite large so I also have back end filters and pagination built into reduce the size of the retrieved data and therefore the orderby needs to happen before or at the same time as the filters and skip and take to avoid ordering the smaller data. Here is an example of the QueryOver call:
IList<Event> s =
session.QueryOver<Event>(() => #eventAlias)
.Fetch(#event => #event.FiscalYear).Eager
.JoinQueryOver(() => #eventAlias.FiscalYear, () => fyAlias, JoinType.InnerJoin, Restrictions.On(() => fyAlias.Id).IsIn(_years))
.Where(() => !#eventAlias.IsDeleted);
.OrderBy(() => fyAlias.RefCode).Asc
.ThenBy(() => #eventAlias.Name).Asc
.Skip(numberOfRecordsToSkip)
.Take(numberOfRecordsInPage)
.List();
How can I accomplish this?
One way how to achieve this (one of many, because you can also use some fully-typed filter object etc or some query builder) could be like this draft:
Part one and two:
// I. a reference to our query
var query = session.QueryOver<Event>(() => #eventAlias);
// II. join, filter... whatever needed
query
.Fetch(#event => #event.FiscalYear).Eager
var joinQuery = query
.JoinQueryOver(...)
.Where(() => !#eventAlias.IsDeleted)
...
Part three:
// III. Order BY
// Assume we have a list of strings (passed from a UI client)
// here represented by these two values
var sortBy = new List<string> {"Name", "CodeDesc"};
// first, have a reference for the OrderBuilder
IQueryOverOrderBuilder<Event, Event> order = null;
// iterate the list
foreach (var sortProperty in sortBy)
{
// use Desc or Asc?
var useDesc = sortProperty.EndsWith("Desc");
// Clean the property name
var name = useDesc
? sortProperty.Remove(sortProperty.Length - 4, 4)
: sortProperty;
// Build the ORDER
order = order == null
? query.OrderBy(Projections.Property(name))
: query.ThenBy(Projections.Property(name))
;
// use DESC or ASC
query = useDesc ? order.Desc : order.Asc;
}
Finally the results:
// IV. back to query... call the DB and get the result
IList<Event> s = query
.List<Event>();
This draft is ready to do sorting on top of the root query. You can also extend that to be able to add some order statements to joinQuery (e.g. if the string is "FiscalYear.MonthDesc"). The logic would be similar, but built around the joinQuery (see at the part one)

How to optimize this linq to objects query?

I'm matching some in memory lists entities with a .contains (subselect) query to filter out old from new users.
Checking for performance problems i saw this:
The oldList mostly has around 1000 users in them, while the new list varies from 100 to 500. Is there a way to optimize this query?
Absolutely - build a set instead of checking a list each time:
// Change string to whatever the type of UserID is.
var oldUserSet = new HashSet<string>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldUserSet.Contains(n.UserID))
.ToList();
The containment check on a HashSet should be O(1) assuming few hash collisions, instead of the O(N) of checking each against the whole sequence (for each new user).
You could make a HashSet<T> of your user IDs in advance. This will cause the Contains to become an O(1) operation:
var oldSet = new HashSet<int>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldSet.Contains(n.UserID)).ToList();
While those HashSet<T> answers are straightforward and simple, some may prefer a linq-centric solution.
LinqToObjects implements join and GroupJoin with a HashSet. Just use one of those - this example uses GroupJoin:
List<User> newUsers =
(
from n in NewList
join o in oldList on n.UserId equals o.UserId into oldGroup
where !oldGroup.Any()
select n
).ToList()

minimum value in dictionary using linq

I have a dictionary of type
Dictionary<DateTime,double> dictionary
How can I retrive a minimum value and key coresponding to this value from this dictionary using linq ?
var min = dictionary.OrderBy(kvp => kvp.Value).First();
var minKey = min.Key;
var minValue = min.Value;
This is not very efficient though; you might want to consider MoreLinq's MinBy extension method.
If you are performing this query very often, you might want to consider a different data-structure.
Aggregate
var minPair = dictionary.Aggregate((p1, p2) => (p1.Value < p2.Value) ? p1 : p2);
Using the mighty Aggregate method.
I know that MinBy is cleaner in this case, but with Aggregate you have more power and its built-in. ;)
Dictionary<DateTime, double> dictionary;
//...
double min = dictionary.Min(x => x.Value);
var minMatchingKVPs = dictionary.Where(x => x.Value == min);
You could combine it of course if you really felt like doing it on one line, but I think the above is easier to read.
var minMatchingKVPs = dictionary.Where(x => x.Value == dictionary.Min(y => y.Value));
You can't easily do this efficiently in normal LINQ - you can get the minimal value easily, but finding the key requires another scan through. If you can afford that, use Jess's answer.
However, you might want to have a look at MinBy in MoreLINQ which would let you write:
var pair = dictionary.MinBy(x => x.Value);
You'd then have the pair with both the key and the value in, after just a single scan.
EDIT: As Nappy says, MinBy is also in System.Interactive in Reactive Extensions.

Why is this LINQ so slow?

Can anyone please explain why the third query below is orders of magnitude slower than the others when it oughtn't to take any longer than doing the first two in sequence?
var data = Enumerable.Range(0, 10000).Select(x => new { Index = x, Value = x + " is the magic number"}).ToList();
var test1 = data.Select(x => new { Original = x, Match = data.Single(y => y.Value == x.Value) }).Take(1).Dump();
var test2 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == x.Index) }).Take(1).Dump();
var test3 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1).Dump();
EDIT: I've added a .ToList() to the original data generation because I don't want any repeated generation of the data clouding the issue.
I'm just trying to understand why this code is so slow by the way, not looking for faster alternative, unless it sheds some light on the matter. I would have thought that if Linq is lazily evaluated and I'm only looking for the first item (Take(1)) then test3's:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1);
could reduce to:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == 1) }).Take(1)
in O(N) as the first item in data is successfully matched after one full scan of the data by the inner Single(), leaving one more sweep of the data by the remaining Single(). So still all O(N).
It's evidently being processed in a more long winded way but I don't really understand how or why.
Test3 takes a couple of seconds to run by the way, so I think we can safely assume that if your answer features the number 10^16 you've made a mistake somewhere along the line.
The first two "tests" are identical, and both slow. The third adds another entire level of slowness.
The first two LINQ statements here are quadratic in nature. Since your "Match" element potentially requires iterating through the entire "data" sequence in order to find the match, as you progress through the range, the length of time for that element will get progressively longer. The 10000th element, for example, will force the engine to iterate through all 10000 elements of the original sequence to find the match, making this an O(N^2) operation.
The "test3" operation takes this to an entirely new level of pain, since it's "squaring" the O(N^2) operation in the second single - forcing it to do another quadratic operation on top of the first one - which is going to be a huge number of operations.
Each time you do data.Single(...) with the match, you're doing an O(N^2) operation - the third test basically becomes O(N^4), which will be orders of magnitude slower.
Fixed.
var data = Enumerable.Range(0, 10000)
.Select(x => new { Index = x, Value = x + " is the magic number"})
.ToList();
var forward = data.ToLookup(x => x.Index);
var backward = data.ToLookup(x => x.Value);
var test1 = data.Select(x => new { Original = x,
Match = backward[x.Value].Single()
} ).Take(1).Dump();
var test2 = data.Select(x => new { Original = x,
Match = forward[x.Index].Single()
} ).Take(1).Dump();
var test3 = data.Select(x => new { Original = x,
Match = forward[backward[x.Value].Single().Index].Single()
} ).Take(1).Dump();
In the original code,
data.ToList() generates 10,000 instances (10^4).
data.Select( data.Single() ).ToList() generates 100,000,000 instances (10^8).
data.Select( data.Single( data.Single() ) ).ToList() generates 100,000,000,000,000,000 instances (10^16).
Single and First are different. Single throws if multiple instances are encountered. Single must fully enumerate its source to check for multiple instances.

Resources