Linq Efficiency question - foreach vs aggregates - linq

Which is more efficient?
//Option 1
foreach (var q in baseQuery)
{
m_TotalCashDeposit += q.deposit.Cash
m_TotalCheckDeposit += q.deposit.Check
m_TotalCashWithdrawal += q.withdraw.Cash
m_TotalCheckWithdrawal += q.withdraw.Check
}
//Option 2
m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
I guess what I'm asking is, calling Sum will basically enumerate over the list right? So if I call Sum four times, isn't that enumerating over the list four times? Wouldn't it be more efficient to just do a foreach instead so I only have to enumerate the list once?

It might, and it might not, it depends.
The only sure way to know is to actually measure it.
To do that, use BenchmarkDotNet, here's an example which you can run in LINQPad or a console application:
void Main()
{
BenchmarkSwitcher.FromAssembly(GetType().Assembly).RunAll();
}
public class Benchmarks
{
[Benchmark]
public void Option1()
{
// foreach (var q in baseQuery)
// {
// m_TotalCashDeposit += q.deposit.Cash;
// m_TotalCheckDeposit += q.deposit.Check;
// m_TotalCashWithdrawal += q.withdraw.Cash;
// m_TotalCheckWithdrawal += q.withdraw.Check;
// }
}
[Benchmark]
public void Option2()
{
// m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
// m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
// m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
// m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
}
}
BenchmarkDotNet is a powerful library for measuring performance, and is much more accurate than simply using Stopwatch, as it will use statistically correct approaches and methods, and also take such things as JITting and GC into account.
Now that I'm older and wiser I no longer belive using Stopwatch is a good way to measure performance. I won't remove the old answer, as google and similar links may lead people here looking for how to use Stopwatch to measure performance, but I hope I have added a better approach above.
Original answer below
Simple code to measure it:
Stopwatch sw = new Stopwatch();
sw.Start();
// your code here
sw.Stop();
Debug.WriteLine("Time taken: " + sw.ElapsedMilliseconds + " ms");
sw.Reset(); // in case you have more code below that reuses sw
You should run the code multiple times to avoid having JITting having too large an effect on your timings.

I went ahead and profiled this and found that you are correct.
Each Sum() effectively creates its own loop. In my simulation, I had it sum SQL dataset with 20319 records, each with 3 summable fields and found that creating your own loop had a 2X advantage.
I had hoped that LINQ would optimize this away and push the whole burden on the SQL server, but unless I move the sum request into the initial LINQ statement, it executes each request one at a time.

Related

Linq Select into New Object Performance

I am new to Linq, using C#. I got a big surprise when I executed the following:
var scores = objects.Select( i => new { object = i,
score1 = i.algorithm1(),
score2 = i.algorithm2(),
score3 = i.algorithm3() } );
double avg2 = scores.Average( i => i.score2); // algorithm1() is called for every object
double cutoff2 = avg2 + scores.Select( i => i.score2).StdDev(); // algorithm1() is called for every object
double avg3 = scores.Average( i => i.score3); // algorithm1() is called for every object
double cutoff3 = avg3 + scores.Select( i => i.score3).StdDev(); // algorithm1() is called for every object
foreach( var s in scores.Where( i => i.score2 > cutoff2 | i.score3 > cutoff3 ).OrderBy( i => i.score1 )) // algorithm1() is called for every object
{
Debug.Log(String.Format ("{0} {1} {2} {3}\n", s.object, s.score1, s.score2/avg2, s.score3/avg3));
}
The attributes in my new objects store the function calls rather than the values. Each time I tried to access an attribute, the original function is called. I assume this is a huge waste of time? How can I avoid this?
Yes, you've discovered that LINQ uses deferred execution. This is a normal part of LINQ, and very handy indeed for building up queries without actually executing anything until you need to - which in turn is great for pipelines of multiple operations over potentially huge data sources which can be streamed.
For more details about how LINQ to Objects works internally, you might want to read my Edulinq blog series - it's basically a reimplementation of the whole of LINQ to Objects, one method at a time. Hopefully by the end of that you'll have a much clearer idea of what to expect.
If you want to materialize the query, you just need to call ToList or ToArray to build an in-memory copy of the results:
var scores = objects.Select( i => new { object = i,
score1 = i.algorithm1(),
score2 = i.algorithm2(),
score3 = i.algorithm3() } ).ToList();

LINQ Out of Memory Error

I am querying 200k records and using up all the server's memory (no surprise). I am new to LINQ so I found the following code that should help me but I don't know how to use it:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
Source: http://goo.gl/aQZIj
Here is my code which creates the "out of memory" error. How do I incorporate the new Batch function into my code?
var crmMetrics = _crmDbContext.tpm_metricsSet.Where(a => a.ModifiedOn >= lastRunDate);
foreach (var crmMetric in crmMetrics)
{
metric = new Metric();
metric.ProductKey = crmMetric.tpm_Product.Id;
dbContext.Metrics.Add(metric);
dbContext.SaveChanges();
}
It's an extension method, so if it is part of a static class and there is a reference to the class's namespace in your code you could do:
var crmMetricsBatches = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.AsEnumerable() // !!
.Batch(20);
Except it wouldn't help. By the .AsEnumerable(), you still fetch all data in memory but now in chunks of 20. This is because you can't use the method directly against IQueryable: Entity Framework will try to translate it to SQL but of course has no clue how to do that.
As said by TGH, Skip and Take are more made for this:
var crmMetricsPage = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.OrderBy(a => a.??) // some property you choose
.Skip(pageNo * pageSize)
.Take(pageSize);
where pageNo counts from 0 to the number of pages (- 1) you're going to need. Skip and Take are expressions, and EF knows how to convert these to SQL. The OrderBy is required for EF to know where to start skipping.
In this process, called paging, you always get pageSize records at a time. The number of queries is greater, but resources are spared. One condition is that you can determine a pageSize in advance. I don't know if this fits with your logic.
If you can't use paging you should try to narrow the filter (Where(a => a.ModifiedOn >= lastRunDate), e.g. try to get the data in batches of one day or week.
I would use Linq's Skip and Take to get the batches
Check this out:
http://www.c-sharpcorner.com/UploadFile/3d39b4/take-and-skip-operator-in-linq-to-sql/

Why is the second LINQ query faster?

In this code:
static bool Spin(int WaitTime)
{
Console.WriteLine("Running task {0} : thread {1}]",
Task.CurrentId, Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(WaitTime);
return true;
}
public void DemoPLINQLong()
{
var SomeBigNumber = 1000000;
var sequence = Enumerable.Range(0, SomeBigNumber);
var sw = new Stopwatch();
sw.Start();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var synchTime = sw.Elapsed;
sw.Restart();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var asynchTime = sw.Elapsed;
Console.WriteLine("Synchronous: {0} Asynchronous: {1}",
synchTime.ToString(), asynchTime.ToString());
}
The results are consistent:
Synchronous: 00:00:00.0021800 Asynchronous: 00:00:00.0000076
Why is the second LINQ query hundreds of times faster? Is there some kind of caching going on? How?
DotNet caches and creates performance optimizations the first time anything is executed; this is known as a Just In Time environment (JIT). Upon subsequent calls to the same code, the run time environment can re-use the existing optimizations which is why you'll frequently see the first run of nearly anything being much slower than subsequent runs of the same code.
A couple of side notes about the posted code:
Not sure what the "Synchronous" and "Asynchronous" terms are referring to; both examples are the exact same thing and there is nothing Asynchronous about them.
If you're not aware, none of the LINQ is being evaluated in the example due to the nature of LINQ's deferred execution. You can see this behavior if you change the example from: sequence.Where(i => Spin(SomeBigNumber)) to sequence.Where(i => Spin(SomeBigNumber)).ToList(). Where, ToList() will force the evaluation of the LINQ predicate and the Console.WriteLine will be written to the console in the Spin method.

Why is performance so bad on this EF data import method?

I have a loop, abridged here, that performs an import of employee records, as follows:
var cts2005 = new Cts2005Entities();
IEmployeeRepository repository = new EmployeeRepository();
foreach (var c in cts2005.Candidates)
{
var e = new Employee();
e.RefNum = c.CA_EMP_ID;
e.TitleId = GetTitleId(c.TITLE);
e.Initials = c.CA_INITIALS;
e.Surname = c.CA_SURNAME;
repository.Insert(e);
}
There are actually several more fields, and a total of nine lookups like GetTitleId(c.TITLE)
above. Code for these is all exactly like this:
private List<Title> _titles;
private Guid GetTitleId(string titleName)
{
ITitleRepository repository = new TitleRepository();
if (_titles == null)
{
_titles = repository.ListAll().ToList();
}
var title = _titles.FirstOrDefault(t => String.Compare(t.Name, titleName, StringComparison.OrdinalIgnoreCase) == 0);
if (title == null)
{
title = new Title { Name = titleName };
repository.Insert(title);
_titles.Add(title);
}
return title.Id;
}
All repository.Insert() calls look like this, except the entity types differ:
public void Insert(Employee entity)
{
CurrentDbContext.Employees.Add(entity);
CurrentDbContext.SaveChanges();
}
And all PK's are Guid. I know this could be a small problem, but I didn't expect it to have such a large effect with small volumes like this.
I have done no tuning or optimization on this routine yet, as it was only for my small test db, but yesterday I was forced to do a surprise import of 6000 records. Toward the end, this processed had slowed to about 1 sec per record, which is quite dismal. I wouldn't have expected high speeds without some tuning, but nothing as low as that.
Is there anything obviously, grossly wrong with my method here?
Your GetTitleId is pulling all Titles from the database once down into the application, but it is doing a linear search over all of them for each "Candidate". That is likely to be very expensive. Use a client-side hashtable with a StringComparer.OrdinalIgnoreCase.
Also, why don't you profile your application? Put load on it and hit break in the debugger 10 times. Where does it stop most of the time? This is the hot-spot.
Thanks to the comment from marc_s above, I changed the routine to only call SaveChanges every 500 inserts, and the speed improved with about 500%.

When is LINQ (to objects) Overused?

My career started as a hard-core functional-paradigm developer (LISP), and now I'm a hard-core .net/C# developer. Of course I'm enamored with LINQ. However, I also believe in (1) using the right tool for the job and (2) preserving the KISS principle: of the 60+ engineers I work with, perhaps only 20% have hours of LINQ / functional paradigm experience, and 5% have 6 to 12 months of such experience. In short, I feel compelled to stay away from LINQ unless I'm hampered in achieving a goal without it (wherein replacing 3 lines of O-O code with one line of LINQ is not a "goal").
But now one of the engineers, having 12 months LINQ / functional-paradigm experience, is using LINQ to objects, or at least lambda expressions anyway, in every conceivable location in production code. My various appeals to the KISS principle have not yielded any results. Therefore...
What published studies can I next appeal to? What "coding standard" guideline have others concocted with some success? Are there published LINQ performance issues I could point out? In short, I'm trying to achieve my first goal - KISS - by indirect persuasion.
Of course this problem could be extended to countless other areas (such as overuse of extension methods). Perhaps there is an "uber" guide, highly regarded (e.g. published studies, etc), that takes a broader swing at this. Anything?
LATE EDIT: Wow! I got schooled! I agree I'm coming at this entirely wrong-headed. But as a clarification, please take a look below at sample code I'm actually seeing. Originally it compiled and worked, but its purpose is now irrelevant. Just go with the "feel" of it. Now that I'm revisiting this sample a half year later, I'm getting a very different picture of what is actually bothering me. But I'd like to have better eyes than mine make the comments.
//This looks like it was meant to become an extension method...
public class ExtensionOfThreadPool
{
public static bool QueueUserWorkItem(Action callback)
{
return ThreadPool.QueueUserWorkItem((o) => callback());
}
}
public class LoadBalancer
{
//other methods and state variables have been stripped...
void ThreadWorker()
{
// The following callbacks give us an easy way to control whether
// we add additional headers around outbound WCF calls.
Action<Action> WorkRunner = null;
// This callback adds headers to each WCF call it scopes
Action<Action> WorkRunnerAddHeaders = (Action action) =>
{
// Add the header to all outbound requests.
HttpRequestMessageProperty httpRequestMessage = new HttpRequestMessageProperty();
httpRequestMessage.Headers.Add("user-agent", "Endpoint Service");
// Open an operation scope - any WCF calls in this scope will add the
// headers above.
using (OperationContextScope scope = new OperationContextScope(_edsProxy.InnerChannel))
{
// Seed the agent id header
OperationContext.Current.OutgoingMessageProperties[HttpRequestMessageProperty.Name] = httpRequestMessage;
// Activate
action();
}
};
// This callback does not add any headers to each WCF call
Action<Action> WorkRunnerNoHeaders = (Action action) =>
{
action();
};
// Assign the work runner we want based on the userWCFHeaders
// flag.
WorkRunner = _userWCFHeaders ? WorkRunnerAddHeaders : WorkRunnerNoHeaders;
// This outter try/catch exists simply to dispose of the client connection
try
{
Action Exercise = () =>
{
// This worker thread polls a work list
Action Driver = null;
Driver = () =>
{
LoadRunnerModel currentModel = null;
try
{
// random starting value, it matters little
int minSleepPeriod = 10;
int sleepPeriod = minSleepPeriod;
// Loop infinitely or until stop signals
while (!_workerStopSig)
{
// Sleep the minimum period of time to service the next element
Thread.Sleep(sleepPeriod);
// Grab a safe copy of the element list
LoadRunnerModel[] elements = null;
_pointModelsLock.Read(() => elements = _endpoints);
DateTime now = DateTime.Now;
var pointsReadyToSend = elements.Where
(
point => point.InterlockedRead(() => point.Live && (point.GoLive <= now))
).ToArray();
// Get a list of all the points that are not ready to send
var pointsNotReadyToSend = elements.Except(pointsReadyToSend).ToArray();
// Walk each model - we touch each one inside a lock
// since there can be other threads operating on the model
// including timeouts and returning WCF calls.
pointsReadyToSend.ForEach
(
model =>
{
model.Write
(
() =>
{
// Keep a record of the current model in case
// it throws an exception while we're staging it
currentModel = model;
// Lower the live flag (if we crash calling
// BeginXXX the catch code will re-start us)
model.Live = false;
// Get the step for this model
ScenarioStep step = model.Scenario.Steps.Current;
// This helper enables the scenario watchdog if a
// scenario is just starting
Action StartScenario = () =>
{
if (step.IsFirstStep && !model.Scenario.EnableWatchdog)
{
model.ScenarioStarted = now;
model.Scenario.EnableWatchdog = true;
}
};
// make a connection (if needed)
if (step.UseHook && !model.HookAttached)
{
BeginReceiveEventWindow(model, step.HookMode == ScenarioStep.HookType.Polled);
step.RecordHistory("LoadRunner: Staged Harpoon");
StartScenario();
}
// Send/Receive (if needed)
if (step.ReadyToSend)
{
BeginSendLoop(model);
step.RecordHistory("LoadRunner: Staged SendLoop");
StartScenario();
}
}
);
}
, () => _workerStopSig
);
// Sleep until the next point goes active. Figure out
// the shortest sleep period we have - that's how long
// we'll sleep.
if (pointsNotReadyToSend.Count() > 0)
{
var smallest = pointsNotReadyToSend.Min(ping => ping.GoLive);
sleepPeriod = (smallest > now) ? (int)(smallest - now).TotalMilliseconds : minSleepPeriod;
sleepPeriod = sleepPeriod < 0 ? minSleepPeriod : sleepPeriod;
}
else
sleepPeriod = minSleepPeriod;
}
}
catch (Exception eWorker)
{
// Don't recover if we're shutting down anyway
if (_workerStopSig)
return;
Action RebootDriver = () =>
{
// Reset the point SendLoop that barfed
Stagepoint(true, currentModel);
// Re-boot this thread
ExtensionOfThreadPool.QueueUserWorkItem(Driver);
};
// This means SendLoop barfed
if (eWorker is BeginSendLoopException)
{
Interlocked.Increment(ref _beginHookErrors);
currentModel.Write(() => currentModel.HookAttached = false);
RebootDriver();
}
// This means BeginSendAndReceive barfed
else if (eWorker is BeginSendLoopException)
{
Interlocked.Increment(ref _beginSendLoopErrors);
RebootDriver();
}
// The only kind of exceptions we expect are the
// BeginXXX type. If we made it here something else bad
// happened so allow the worker to die completely.
else
throw;
}
};
// Start the driver thread. This thread will poll the point list
// and keep shoveling them out
ExtensionOfThreadPool.QueueUserWorkItem(Driver);
// Wait for the stop signal
_workerStop.WaitOne();
};
// Start
WorkRunner(Exercise);
}
catch(Exception ex){//not shown}
}
}
Well, it sounds to me like you're the one wanting to make the code more complicated - because you believe your colleagues aren't up to the genuinely simple approach. In many, many cases I find LINQ to Objects makes the code simpler - and yes that does include changing just a few lines to one:
int count = 0;
foreach (Foo f in GenerateFoos())
{
count++;
}
becoming
int count = GenerateFoos().Count();
for example.
Where it isn't making the code simpler, it's fine to try to steer him away from LINQ - but the above is an example where you certainly aren't significantly hampered by avoiding LINQ, but the "KISS" code is clearly the LINQ code.
It sounds like your company could benefit from training up its engineers to take advantage of LINQ to Objects, rather than trying to always appeal to the lowest common denominator.
You seem to be equating Linq to objects with greater complexity, because you assume that unnecessary use of it violates "keep it simple, stupid".
All my experience has been the opposite: it makes complex algorithms much simpler to write and read.
On the contrary, I now regard imperative, statement-based, state-mutational programming as the "risky" option to be used only when really necessary.
So I'd suggest that you put effort into getting more of your colleagues to understand the benefit. It's a false economy to try to limit your approaches to those that you (and others) already understand, because in this industry it pays huge dividends to stay in touch with "new" practises (of course, this stuff is hardly new, but as you point out, it's new to many from a Java or C# 1.x background).
As for trying to pin some charge of "performance issues" on it, I don't think you're going to have much luck. The overhead involved in Linq-to-objects itself is minuscule.

Resources