ES6 spread elements vs nested forEach performance - performance

I have the following function
const modules = [{courses:[...]},{courses:[...]},...]
const deleteCourses = [];
modules.forEach((mod) => {
mod.courses.forEach((course) => {
deleteCourses.push(course));
}
// versus
deleteCourses = [...deleteCourses, ...mod.courses];
});
Assuming modules and courses to have between 30-100 length I was wondering which one of those is more efficient?
On the one hand I was taught to avoid nesting forEach loops. In the other hand array literal creates a new Array instance each time.
Thanks!

It would seem that nested forEach is very much faster, as this jsPerf shows:
Setup:
const modules = Array(30).fill({courses:Array(30).fill(1)}) //30x30 elements
let deleteCourses = [];
Case 1: nested forEach - 29,293 ops/sec
modules.forEach((mod) => {
mod.courses.forEach((course) => {
deleteCourses.push(course);
})
})
Case 2: ES6 Spread Operator - 49.13 ops/sec
modules.forEach((mod) => {
deleteCourses = [...deleteCourses, ...mod.courses];
})
That's about 600x faster for that 30x30 sample
This makes sense when you consider the amount of redundancy in respreading deleteCourses on every iteration. In contrast to nested forEach, the number of addition operations performed per iteration is about the length of deleteCourses for that iteration. That number is also growing on every iteration.
So the difference in performance has little to do with the creation of new Array instances, and much to do with the multitude of redundant steps created by this misusage of the spread operator.
To be sure, looking at the two cases individually, we can see that:
the ES6 Spread Operator algorithm is exponential: O(2^n)
the nested forEach algorithm is linear: O(n)

Related

LINQ: Improving performance of "query to find all dictionaries from list of dictionaries where given key has at least one value from list of values"

I tried searching for existing questions, but I could not find anything, so apologize if this is duplicate question.
I have following piece of code. This code runs in a loop for different values of key and listOfValues (listOfDict does not change and built only once, key and listOfValues vary for each iteration). This code currently works, but profiler shows that 50% of the execution time is spent in this LINQ query. Can I improve performance - using different LINQ construct perhaps?
// List of dictionary that allows multiple values against one key.
List<Dictionary<string, List<string>>> listOfDict = BuildListOfDict();
// Following code & LINQ query runs in a loop.
List<string> listOfValues = BuildListOfValues();
string key = GetKey();
// LINQ query to find all dictionaries from listOfDict
// where given key has at least one value from listOfValues.
List<Dictionary<string, List<string>>> result = listOfDict
.Where(dict => dict[key]
.Any(lhs => listOfValues.Any(rhs => lhs == rhs)))
.ToList();
Using HashSet will perform significantly better. You can create a HashSet<string> like so:
IEnumerable<string> strings = ...;
var hashSet = new HashSet<string>(strings);
I assume you can change your methods to return HashSets and make them run like this:
List<Dictionary<string, HashSet<string>>> listOfDict = BuildListOfDict();
HashSet<string> listOfValues = BuildListOfValues();
string key = GetKey();
List<Dictionary<string, HashSet<string>>> result = listOfDict
.Where(dict => listOfValues.Overlaps(dict[key]))
.ToList();
Here HashSet's instance method Overlaps is used. HashSet is optimized for set operations like this. In a test using one dictionary of 200 elements this runs in 3% of the time compared to your method.
UPDATED: Per #GertArnold, switched from Any/Contains to HashSet.Overlaps for slight performance improvement.
Depending on whether listOfValues or the average value for a key is longer, you can either convert listOfValues to a HashSet<string> or build your list of dictionaries to have a HashSet<string> for each value:
// optimize testing against listOfValues
var valHS = listOfValues.ToHashSet();
var result2 = listOfDict.Where(dict => valHS.Overlaps(dict[key]))
.ToList();
// change structure to optimize query
var listOfDict2 = listOfDict.Select(dict => dict.ToDictionary(kvp => kvp.Key, kvp => kvp.Value.ToHashSet())).ToList();
var result3 = listOfDict2.Where(dict => dict[key].Overlaps(listOfValues))
.ToList();
Note: if the query is repeated with differing listOfValues, it probably makes more sense to build the HashSet in the dictionaries once, rather than computing a HashSet from each listOfValues.
#LasseVågsætherKarlsen suggestion in comments to invert the structure intrigued me, so with a further refinement to handle the multiple keys, I created an index structure and tested lookups. With my Test Harness, this is about twice as fast as using a HashSet for one of the List<string>s and four times faster than the original method:
var listOfKeys = listOfDict.First().Select(d => d.Key);
var lookup = listOfKeys.ToDictionary(k => k, k => listOfDict.SelectMany(d => d[k].Select(v => (v, d))).ToLookup(vd => vd.v, vd => vd.d));
Now to filter for a particular key and list of values:
var result4 = listOfValues.SelectMany(v => lookup[key][v]).Distinct().ToList();

Resolve array of observables and append in final array

I have an endpoint url like http://site/api/myquery?start=&limit= which returns an array of strings.
If I call this endpoint in this way, the server hangs since the array of strings length is huge.
I need to generate an an array of observables with incremental "start" and "limit" parameters, resolve all of then either in sequence or in parallel, and then get a final observable which at the end yields the true array of strings, obtained merging all the subarray of strings returned by the inner observables.
How should I do that?
i.e. the array of observables would be something like
[
httpClient.get(http://site/api/myquery?start=0&limit=1000),
httpClient.get(http://site/api/myquery?start=1000&limit=1000),
httpClient.get(http://site/api/myquery?start=2000&limit=1000),
....
]
If you know the length before making all these queries — then you can create as many http-get Observables as you need, and then forkJoin them using projection fn.
forkJoin will let you make parallel queries and then merge results of those queries. Heres an example:
import { forkJoin } from 'rxjs';
// given we know the length:
const LENGTH = 500;
// we can pick arbitrary page size
const PAGE_SIZE = 50;
// calculate requests count
const requestsCount = Math.ceil(LENGTH / 50);
// generate calculated number of requests
const requests = (new Array(requestsCount))
.fill(void 0)
.map((_,i) => {
const start = i * PAGE_SIZE;
return http.get(`http://site/api/myquery?start=${start}&limit=${PAGE_SIZE}`);
});
forkJoin(
requests,
// projecting fn
// merge all arrays into one
// suboptimal merging, just for example
(...results) => results.reduce(((acc, curr)=> [...acc, ...curr]) , [])
).subscribe(array => {
console.log(array);
})
Check this forkJoin example for reference.
Hope this helps
In the case that you do not know the total number of items, you can do this using expand.
The following article gives a good introduction to expand and an explanation of how to use it for pagination.
https://ncjamieson.com/understanding-expand/
Something along the lines of the code below would work in your case, making the requests for each page in series.
const limit = 1000;
let currentStart = 0;
let getUrl = (start, limit) => `http://site/api/myquery?start=${start}&limit=${limit}`;
httpClient.get(getUrl(currentStart, limit)).pipe(
expand(itemsArray => {
if (itemsArray.length) {
currentStart += limit;
return httpClient.get(getUrl(currentStart, limit));
}
return empty();
}),
reduce((acc, value) => [...acc, ...value]),
).subscribe(itemsArray => {
console.log(itemsArray);
})
This will log out the final array of items once the entire series of requests has been resolved.

get Biggest collection in a collection of collection

I have a Collection of collection.
I would like to get the biggest collection inside the collection.
I wrote a function that works well, but I'm pretty sure it can be done much quicker:
private function getMaxFightersByEntity($userGroups): int
{
$max = 0;
foreach ($userGroups as $userGroup) { // $userGroup is another Collection
if (count($userGroup) > $max) {
$max = count($userGroup);
}
}
return $max;
}
I'm quite sure there is a better way managing collection, but don't really know it.
Anyone has a better solution???
You can sort the collection by the count of the inner collections, and then just take the first item (largest group).
// sortByDesc: sort the groups by their size, largest first
// first: get the first item in the result: the largest group
// count: get the size of the largest group
return $userGroups
->sortByDesc(function ($group) {
return $group->count();
})
->first()
->count();
It won't be "quicker" than your current solution in execution time, but it is written to take advantage of the functions provided by collections.

What are the benefits of a Deferred Execution in LINQ?

LINQ uses a Deferred Execution model which means that resulting sequence is not returned at the time the Linq operators are called, but instead these operators return an object which then yields elements of a sequence only when we enumerate this object.
While I understand how deferred queries work, I'm having some trouble understanding the benefits of deferred execution:
1) I've read that deferred query executing only when you actually need the results can be of great benefit. So what is this benefit?
2) Other advantage of deferred queries is that if you define a query once, then each time you enumerate the results, you will get different results if the data changes.
a) But as seen from the code below, we're able to achieve the same effect ( thus each time we enumerate the resource, we get different result if data changed ) even without using deferred queries:
List<string> sList = new List<string>( new[]{ "A","B" });
foreach (string item in sList)
Console.WriteLine(item); // Q1 outputs AB
sList.Add("C");
foreach (string item in sList)
Console.WriteLine(item); // Q2 outputs ABC
3) Are there any other benefits of deferred execution?
The main benefit is that this allows filtering operations, the core of LINQ, to be much more efficient. (This is effectively your item #1).
For example, take a LINQ query like this:
var results = collection.Select(item => item.Foo).Where(foo => foo < 3).ToList();
With deferred execution, the above iterates your collection one time, and each time an item is requested during the iteration, performs the map operation, filters, then uses the results to build the list.
If you were to make LINQ fully execute each time, each operation (Select / Where) would have to iterate through the entire sequence. This would make chained operations very inefficient.
Personally, I'd say your item #2 above is more of a side effect rather than a benefit - while it's, at times, beneficial, it also causes some confusion at times, so I would just consider this "something to understand" and not tout it as a benefit of LINQ.
In response to your edit:
In your particular example, in both cases Select would iterate collection and return an IEnumerable I1 of type item.Foo. Where() would then enumerate I1 and return IEnumerable<> I2 of type item.Foo. I2 would then be converted to List.
This is not true - deferred execution prevents this from occurring.
In my example, the return type is IEnumerable<T>, which means that it's a collection that can be enumerated, but, due to deferred execution, it isn't actually enumerated.
When you call ToList(), the entire collection is enumerated. The result ends up looking conceptually something more like (though, of course, different):
List<Foo> results = new List<Foo>();
foreach(var item in collection)
{
// "Select" does a mapping
var foo = item.Foo;
// "Where" filters
if (!(foo < 3))
continue;
// "ToList" builds results
results.Add(foo);
}
Deferred execution causes the sequence itself to only be enumerated (foreach) one time, when it's used (by ToList()). Without deferred execution, it would look more like (conceptually):
// Select
List<Foo> foos = new List<Foo>();
foreach(var item in collection)
{
foos.Add(item.Foo);
}
// Where
List<Foo> foosFiltered = new List<Foo>();
foreach(var foo in foos)
{
if (foo < 3)
foosFiltered.Add(foo);
}
List<Foo> results = new List<Foo>();
foreach(var item in foosFiltered)
{
results.Add(item);
}
Another benefit of deferred execution is that it allows you to work with infinite series. For instance:
public static IEnumerable<ulong> FibonacciNumbers()
{
yield return 0;
yield return 1;
ulong previous = 0, current = 1;
while (true)
{
ulong next = checked(previous + current);
yield return next;
previous = current;
current = next;
}
}
(Source: http://chrisfulstow.com/fibonacci-numbers-iterator-with-csharp-yield-statements/)
You can then do the following:
var firstTenOddFibNumbers = FibonacciNumbers().Where(n=>n%2 == 1).Take(10);
foreach (var num in firstTenOddFibNumbers)
{
Console.WriteLine(num);
}
Prints:
1
1
3
5
13
21
55
89
233
377
Without deferred execution, you would get an OverflowException or if the operation wasn't checked it would run infinitely because it wraps around (and if you called ToList on it would cause an OutOfMemoryException eventually)
An important benefit of deferred execution is that you receive up-to-date data. This may be a hit on performance (especially if you are dealing with absurdly large data sets) but equally the data might have changed by the time your original query returns a result. Deferred execution makes sure you will get the latest information from the database in scenarios where the database is updated rapidly.

Linq Efficiency question - foreach vs aggregates

Which is more efficient?
//Option 1
foreach (var q in baseQuery)
{
m_TotalCashDeposit += q.deposit.Cash
m_TotalCheckDeposit += q.deposit.Check
m_TotalCashWithdrawal += q.withdraw.Cash
m_TotalCheckWithdrawal += q.withdraw.Check
}
//Option 2
m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
I guess what I'm asking is, calling Sum will basically enumerate over the list right? So if I call Sum four times, isn't that enumerating over the list four times? Wouldn't it be more efficient to just do a foreach instead so I only have to enumerate the list once?
It might, and it might not, it depends.
The only sure way to know is to actually measure it.
To do that, use BenchmarkDotNet, here's an example which you can run in LINQPad or a console application:
void Main()
{
BenchmarkSwitcher.FromAssembly(GetType().Assembly).RunAll();
}
public class Benchmarks
{
[Benchmark]
public void Option1()
{
// foreach (var q in baseQuery)
// {
// m_TotalCashDeposit += q.deposit.Cash;
// m_TotalCheckDeposit += q.deposit.Check;
// m_TotalCashWithdrawal += q.withdraw.Cash;
// m_TotalCheckWithdrawal += q.withdraw.Check;
// }
}
[Benchmark]
public void Option2()
{
// m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
// m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
// m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
// m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
}
}
BenchmarkDotNet is a powerful library for measuring performance, and is much more accurate than simply using Stopwatch, as it will use statistically correct approaches and methods, and also take such things as JITting and GC into account.
Now that I'm older and wiser I no longer belive using Stopwatch is a good way to measure performance. I won't remove the old answer, as google and similar links may lead people here looking for how to use Stopwatch to measure performance, but I hope I have added a better approach above.
Original answer below
Simple code to measure it:
Stopwatch sw = new Stopwatch();
sw.Start();
// your code here
sw.Stop();
Debug.WriteLine("Time taken: " + sw.ElapsedMilliseconds + " ms");
sw.Reset(); // in case you have more code below that reuses sw
You should run the code multiple times to avoid having JITting having too large an effect on your timings.
I went ahead and profiled this and found that you are correct.
Each Sum() effectively creates its own loop. In my simulation, I had it sum SQL dataset with 20319 records, each with 3 summable fields and found that creating your own loop had a 2X advantage.
I had hoped that LINQ would optimize this away and push the whole burden on the SQL server, but unless I move the sum request into the initial LINQ statement, it executes each request one at a time.

Resources