Why is the second LINQ query faster? - performance

In this code:
static bool Spin(int WaitTime)
{
Console.WriteLine("Running task {0} : thread {1}]",
Task.CurrentId, Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(WaitTime);
return true;
}
public void DemoPLINQLong()
{
var SomeBigNumber = 1000000;
var sequence = Enumerable.Range(0, SomeBigNumber);
var sw = new Stopwatch();
sw.Start();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var synchTime = sw.Elapsed;
sw.Restart();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var asynchTime = sw.Elapsed;
Console.WriteLine("Synchronous: {0} Asynchronous: {1}",
synchTime.ToString(), asynchTime.ToString());
}
The results are consistent:
Synchronous: 00:00:00.0021800 Asynchronous: 00:00:00.0000076
Why is the second LINQ query hundreds of times faster? Is there some kind of caching going on? How?

DotNet caches and creates performance optimizations the first time anything is executed; this is known as a Just In Time environment (JIT). Upon subsequent calls to the same code, the run time environment can re-use the existing optimizations which is why you'll frequently see the first run of nearly anything being much slower than subsequent runs of the same code.
A couple of side notes about the posted code:
Not sure what the "Synchronous" and "Asynchronous" terms are referring to; both examples are the exact same thing and there is nothing Asynchronous about them.
If you're not aware, none of the LINQ is being evaluated in the example due to the nature of LINQ's deferred execution. You can see this behavior if you change the example from: sequence.Where(i => Spin(SomeBigNumber)) to sequence.Where(i => Spin(SomeBigNumber)).ToList(). Where, ToList() will force the evaluation of the LINQ predicate and the Console.WriteLine will be written to the console in the Spin method.

Related

How to retrieve element from IQueryable in Parallel Loop

I got the my records here.
IQueryable<EmployeeItem> dtEmployee = GetAll();
After that loop that dtEmployee.
This is my normal loop which is working fine.
for (int i = 0; i < dtEmployee.Count(); i++)
{
var drEmployee = dtEmployee.AsEnumerable().ElementAt(i);
}
This is the parallel loop that I want to try.
System.Threading.Tasks.Parallel.For(0, dtEmployee.Count(), i => {
var drEmployee = dtEmployee.AsEnumerable().ElementAt(i);
});
I don't have any compile errors but when I run it in my Visual Studio, I got this error:
This is my normal loop which is working fine.
I wouldn't say fine, it's incredibly inefficient. Since dtEmployee represents a query, and AsEnumerable() means it's no longer a database query, just a lazy sequence, each ElementAt() (and also the Count()) causes a separate database query, which likely retrieves all the employees.
To fix this, use foreach:
foreach (var drEmployee in dtEmployee)
{
}
This is the parallel loop that I want to try.
Since dtEmployee is not thread-safe, you can't access it from multiple threads at the same time. The fix here is actually pretty much the same as above: use Parallel.ForEach():
Parallel.ForEach(dtEmployee, drEmployee => {
});

Why is performance so bad on this EF data import method?

I have a loop, abridged here, that performs an import of employee records, as follows:
var cts2005 = new Cts2005Entities();
IEmployeeRepository repository = new EmployeeRepository();
foreach (var c in cts2005.Candidates)
{
var e = new Employee();
e.RefNum = c.CA_EMP_ID;
e.TitleId = GetTitleId(c.TITLE);
e.Initials = c.CA_INITIALS;
e.Surname = c.CA_SURNAME;
repository.Insert(e);
}
There are actually several more fields, and a total of nine lookups like GetTitleId(c.TITLE)
above. Code for these is all exactly like this:
private List<Title> _titles;
private Guid GetTitleId(string titleName)
{
ITitleRepository repository = new TitleRepository();
if (_titles == null)
{
_titles = repository.ListAll().ToList();
}
var title = _titles.FirstOrDefault(t => String.Compare(t.Name, titleName, StringComparison.OrdinalIgnoreCase) == 0);
if (title == null)
{
title = new Title { Name = titleName };
repository.Insert(title);
_titles.Add(title);
}
return title.Id;
}
All repository.Insert() calls look like this, except the entity types differ:
public void Insert(Employee entity)
{
CurrentDbContext.Employees.Add(entity);
CurrentDbContext.SaveChanges();
}
And all PK's are Guid. I know this could be a small problem, but I didn't expect it to have such a large effect with small volumes like this.
I have done no tuning or optimization on this routine yet, as it was only for my small test db, but yesterday I was forced to do a surprise import of 6000 records. Toward the end, this processed had slowed to about 1 sec per record, which is quite dismal. I wouldn't have expected high speeds without some tuning, but nothing as low as that.
Is there anything obviously, grossly wrong with my method here?
Your GetTitleId is pulling all Titles from the database once down into the application, but it is doing a linear search over all of them for each "Candidate". That is likely to be very expensive. Use a client-side hashtable with a StringComparer.OrdinalIgnoreCase.
Also, why don't you profile your application? Put load on it and hit break in the debugger 10 times. Where does it stop most of the time? This is the hot-spot.
Thanks to the comment from marc_s above, I changed the routine to only call SaveChanges every 500 inserts, and the speed improved with about 500%.

What are the benefits of a Deferred Execution in LINQ?

LINQ uses a Deferred Execution model which means that resulting sequence is not returned at the time the Linq operators are called, but instead these operators return an object which then yields elements of a sequence only when we enumerate this object.
While I understand how deferred queries work, I'm having some trouble understanding the benefits of deferred execution:
1) I've read that deferred query executing only when you actually need the results can be of great benefit. So what is this benefit?
2) Other advantage of deferred queries is that if you define a query once, then each time you enumerate the results, you will get different results if the data changes.
a) But as seen from the code below, we're able to achieve the same effect ( thus each time we enumerate the resource, we get different result if data changed ) even without using deferred queries:
List<string> sList = new List<string>( new[]{ "A","B" });
foreach (string item in sList)
Console.WriteLine(item); // Q1 outputs AB
sList.Add("C");
foreach (string item in sList)
Console.WriteLine(item); // Q2 outputs ABC
3) Are there any other benefits of deferred execution?
The main benefit is that this allows filtering operations, the core of LINQ, to be much more efficient. (This is effectively your item #1).
For example, take a LINQ query like this:
var results = collection.Select(item => item.Foo).Where(foo => foo < 3).ToList();
With deferred execution, the above iterates your collection one time, and each time an item is requested during the iteration, performs the map operation, filters, then uses the results to build the list.
If you were to make LINQ fully execute each time, each operation (Select / Where) would have to iterate through the entire sequence. This would make chained operations very inefficient.
Personally, I'd say your item #2 above is more of a side effect rather than a benefit - while it's, at times, beneficial, it also causes some confusion at times, so I would just consider this "something to understand" and not tout it as a benefit of LINQ.
In response to your edit:
In your particular example, in both cases Select would iterate collection and return an IEnumerable I1 of type item.Foo. Where() would then enumerate I1 and return IEnumerable<> I2 of type item.Foo. I2 would then be converted to List.
This is not true - deferred execution prevents this from occurring.
In my example, the return type is IEnumerable<T>, which means that it's a collection that can be enumerated, but, due to deferred execution, it isn't actually enumerated.
When you call ToList(), the entire collection is enumerated. The result ends up looking conceptually something more like (though, of course, different):
List<Foo> results = new List<Foo>();
foreach(var item in collection)
{
// "Select" does a mapping
var foo = item.Foo;
// "Where" filters
if (!(foo < 3))
continue;
// "ToList" builds results
results.Add(foo);
}
Deferred execution causes the sequence itself to only be enumerated (foreach) one time, when it's used (by ToList()). Without deferred execution, it would look more like (conceptually):
// Select
List<Foo> foos = new List<Foo>();
foreach(var item in collection)
{
foos.Add(item.Foo);
}
// Where
List<Foo> foosFiltered = new List<Foo>();
foreach(var foo in foos)
{
if (foo < 3)
foosFiltered.Add(foo);
}
List<Foo> results = new List<Foo>();
foreach(var item in foosFiltered)
{
results.Add(item);
}
Another benefit of deferred execution is that it allows you to work with infinite series. For instance:
public static IEnumerable<ulong> FibonacciNumbers()
{
yield return 0;
yield return 1;
ulong previous = 0, current = 1;
while (true)
{
ulong next = checked(previous + current);
yield return next;
previous = current;
current = next;
}
}
(Source: http://chrisfulstow.com/fibonacci-numbers-iterator-with-csharp-yield-statements/)
You can then do the following:
var firstTenOddFibNumbers = FibonacciNumbers().Where(n=>n%2 == 1).Take(10);
foreach (var num in firstTenOddFibNumbers)
{
Console.WriteLine(num);
}
Prints:
1
1
3
5
13
21
55
89
233
377
Without deferred execution, you would get an OverflowException or if the operation wasn't checked it would run infinitely because it wraps around (and if you called ToList on it would cause an OutOfMemoryException eventually)
An important benefit of deferred execution is that you receive up-to-date data. This may be a hit on performance (especially if you are dealing with absurdly large data sets) but equally the data might have changed by the time your original query returns a result. Deferred execution makes sure you will get the latest information from the database in scenarios where the database is updated rapidly.

When is LINQ (to objects) Overused?

My career started as a hard-core functional-paradigm developer (LISP), and now I'm a hard-core .net/C# developer. Of course I'm enamored with LINQ. However, I also believe in (1) using the right tool for the job and (2) preserving the KISS principle: of the 60+ engineers I work with, perhaps only 20% have hours of LINQ / functional paradigm experience, and 5% have 6 to 12 months of such experience. In short, I feel compelled to stay away from LINQ unless I'm hampered in achieving a goal without it (wherein replacing 3 lines of O-O code with one line of LINQ is not a "goal").
But now one of the engineers, having 12 months LINQ / functional-paradigm experience, is using LINQ to objects, or at least lambda expressions anyway, in every conceivable location in production code. My various appeals to the KISS principle have not yielded any results. Therefore...
What published studies can I next appeal to? What "coding standard" guideline have others concocted with some success? Are there published LINQ performance issues I could point out? In short, I'm trying to achieve my first goal - KISS - by indirect persuasion.
Of course this problem could be extended to countless other areas (such as overuse of extension methods). Perhaps there is an "uber" guide, highly regarded (e.g. published studies, etc), that takes a broader swing at this. Anything?
LATE EDIT: Wow! I got schooled! I agree I'm coming at this entirely wrong-headed. But as a clarification, please take a look below at sample code I'm actually seeing. Originally it compiled and worked, but its purpose is now irrelevant. Just go with the "feel" of it. Now that I'm revisiting this sample a half year later, I'm getting a very different picture of what is actually bothering me. But I'd like to have better eyes than mine make the comments.
//This looks like it was meant to become an extension method...
public class ExtensionOfThreadPool
{
public static bool QueueUserWorkItem(Action callback)
{
return ThreadPool.QueueUserWorkItem((o) => callback());
}
}
public class LoadBalancer
{
//other methods and state variables have been stripped...
void ThreadWorker()
{
// The following callbacks give us an easy way to control whether
// we add additional headers around outbound WCF calls.
Action<Action> WorkRunner = null;
// This callback adds headers to each WCF call it scopes
Action<Action> WorkRunnerAddHeaders = (Action action) =>
{
// Add the header to all outbound requests.
HttpRequestMessageProperty httpRequestMessage = new HttpRequestMessageProperty();
httpRequestMessage.Headers.Add("user-agent", "Endpoint Service");
// Open an operation scope - any WCF calls in this scope will add the
// headers above.
using (OperationContextScope scope = new OperationContextScope(_edsProxy.InnerChannel))
{
// Seed the agent id header
OperationContext.Current.OutgoingMessageProperties[HttpRequestMessageProperty.Name] = httpRequestMessage;
// Activate
action();
}
};
// This callback does not add any headers to each WCF call
Action<Action> WorkRunnerNoHeaders = (Action action) =>
{
action();
};
// Assign the work runner we want based on the userWCFHeaders
// flag.
WorkRunner = _userWCFHeaders ? WorkRunnerAddHeaders : WorkRunnerNoHeaders;
// This outter try/catch exists simply to dispose of the client connection
try
{
Action Exercise = () =>
{
// This worker thread polls a work list
Action Driver = null;
Driver = () =>
{
LoadRunnerModel currentModel = null;
try
{
// random starting value, it matters little
int minSleepPeriod = 10;
int sleepPeriod = minSleepPeriod;
// Loop infinitely or until stop signals
while (!_workerStopSig)
{
// Sleep the minimum period of time to service the next element
Thread.Sleep(sleepPeriod);
// Grab a safe copy of the element list
LoadRunnerModel[] elements = null;
_pointModelsLock.Read(() => elements = _endpoints);
DateTime now = DateTime.Now;
var pointsReadyToSend = elements.Where
(
point => point.InterlockedRead(() => point.Live && (point.GoLive <= now))
).ToArray();
// Get a list of all the points that are not ready to send
var pointsNotReadyToSend = elements.Except(pointsReadyToSend).ToArray();
// Walk each model - we touch each one inside a lock
// since there can be other threads operating on the model
// including timeouts and returning WCF calls.
pointsReadyToSend.ForEach
(
model =>
{
model.Write
(
() =>
{
// Keep a record of the current model in case
// it throws an exception while we're staging it
currentModel = model;
// Lower the live flag (if we crash calling
// BeginXXX the catch code will re-start us)
model.Live = false;
// Get the step for this model
ScenarioStep step = model.Scenario.Steps.Current;
// This helper enables the scenario watchdog if a
// scenario is just starting
Action StartScenario = () =>
{
if (step.IsFirstStep && !model.Scenario.EnableWatchdog)
{
model.ScenarioStarted = now;
model.Scenario.EnableWatchdog = true;
}
};
// make a connection (if needed)
if (step.UseHook && !model.HookAttached)
{
BeginReceiveEventWindow(model, step.HookMode == ScenarioStep.HookType.Polled);
step.RecordHistory("LoadRunner: Staged Harpoon");
StartScenario();
}
// Send/Receive (if needed)
if (step.ReadyToSend)
{
BeginSendLoop(model);
step.RecordHistory("LoadRunner: Staged SendLoop");
StartScenario();
}
}
);
}
, () => _workerStopSig
);
// Sleep until the next point goes active. Figure out
// the shortest sleep period we have - that's how long
// we'll sleep.
if (pointsNotReadyToSend.Count() > 0)
{
var smallest = pointsNotReadyToSend.Min(ping => ping.GoLive);
sleepPeriod = (smallest > now) ? (int)(smallest - now).TotalMilliseconds : minSleepPeriod;
sleepPeriod = sleepPeriod < 0 ? minSleepPeriod : sleepPeriod;
}
else
sleepPeriod = minSleepPeriod;
}
}
catch (Exception eWorker)
{
// Don't recover if we're shutting down anyway
if (_workerStopSig)
return;
Action RebootDriver = () =>
{
// Reset the point SendLoop that barfed
Stagepoint(true, currentModel);
// Re-boot this thread
ExtensionOfThreadPool.QueueUserWorkItem(Driver);
};
// This means SendLoop barfed
if (eWorker is BeginSendLoopException)
{
Interlocked.Increment(ref _beginHookErrors);
currentModel.Write(() => currentModel.HookAttached = false);
RebootDriver();
}
// This means BeginSendAndReceive barfed
else if (eWorker is BeginSendLoopException)
{
Interlocked.Increment(ref _beginSendLoopErrors);
RebootDriver();
}
// The only kind of exceptions we expect are the
// BeginXXX type. If we made it here something else bad
// happened so allow the worker to die completely.
else
throw;
}
};
// Start the driver thread. This thread will poll the point list
// and keep shoveling them out
ExtensionOfThreadPool.QueueUserWorkItem(Driver);
// Wait for the stop signal
_workerStop.WaitOne();
};
// Start
WorkRunner(Exercise);
}
catch(Exception ex){//not shown}
}
}
Well, it sounds to me like you're the one wanting to make the code more complicated - because you believe your colleagues aren't up to the genuinely simple approach. In many, many cases I find LINQ to Objects makes the code simpler - and yes that does include changing just a few lines to one:
int count = 0;
foreach (Foo f in GenerateFoos())
{
count++;
}
becoming
int count = GenerateFoos().Count();
for example.
Where it isn't making the code simpler, it's fine to try to steer him away from LINQ - but the above is an example where you certainly aren't significantly hampered by avoiding LINQ, but the "KISS" code is clearly the LINQ code.
It sounds like your company could benefit from training up its engineers to take advantage of LINQ to Objects, rather than trying to always appeal to the lowest common denominator.
You seem to be equating Linq to objects with greater complexity, because you assume that unnecessary use of it violates "keep it simple, stupid".
All my experience has been the opposite: it makes complex algorithms much simpler to write and read.
On the contrary, I now regard imperative, statement-based, state-mutational programming as the "risky" option to be used only when really necessary.
So I'd suggest that you put effort into getting more of your colleagues to understand the benefit. It's a false economy to try to limit your approaches to those that you (and others) already understand, because in this industry it pays huge dividends to stay in touch with "new" practises (of course, this stuff is hardly new, but as you point out, it's new to many from a Java or C# 1.x background).
As for trying to pin some charge of "performance issues" on it, I don't think you're going to have much luck. The overhead involved in Linq-to-objects itself is minuscule.

Linq Efficiency question - foreach vs aggregates

Which is more efficient?
//Option 1
foreach (var q in baseQuery)
{
m_TotalCashDeposit += q.deposit.Cash
m_TotalCheckDeposit += q.deposit.Check
m_TotalCashWithdrawal += q.withdraw.Cash
m_TotalCheckWithdrawal += q.withdraw.Check
}
//Option 2
m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
I guess what I'm asking is, calling Sum will basically enumerate over the list right? So if I call Sum four times, isn't that enumerating over the list four times? Wouldn't it be more efficient to just do a foreach instead so I only have to enumerate the list once?
It might, and it might not, it depends.
The only sure way to know is to actually measure it.
To do that, use BenchmarkDotNet, here's an example which you can run in LINQPad or a console application:
void Main()
{
BenchmarkSwitcher.FromAssembly(GetType().Assembly).RunAll();
}
public class Benchmarks
{
[Benchmark]
public void Option1()
{
// foreach (var q in baseQuery)
// {
// m_TotalCashDeposit += q.deposit.Cash;
// m_TotalCheckDeposit += q.deposit.Check;
// m_TotalCashWithdrawal += q.withdraw.Cash;
// m_TotalCheckWithdrawal += q.withdraw.Check;
// }
}
[Benchmark]
public void Option2()
{
// m_TotalCashDeposit = baseQuery.Sum(q => q.deposit.Cash);
// m_TotalCheckDeposit = baseQuery.Sum(q => q.deposit.Check);
// m_TotalCashWithdrawal = baseQuery.Sum(q => q.withdraw.Cash);
// m_TotalCheckWithdrawal = baseQuery.Sum(q => q.withdraw.Check);
}
}
BenchmarkDotNet is a powerful library for measuring performance, and is much more accurate than simply using Stopwatch, as it will use statistically correct approaches and methods, and also take such things as JITting and GC into account.
Now that I'm older and wiser I no longer belive using Stopwatch is a good way to measure performance. I won't remove the old answer, as google and similar links may lead people here looking for how to use Stopwatch to measure performance, but I hope I have added a better approach above.
Original answer below
Simple code to measure it:
Stopwatch sw = new Stopwatch();
sw.Start();
// your code here
sw.Stop();
Debug.WriteLine("Time taken: " + sw.ElapsedMilliseconds + " ms");
sw.Reset(); // in case you have more code below that reuses sw
You should run the code multiple times to avoid having JITting having too large an effect on your timings.
I went ahead and profiled this and found that you are correct.
Each Sum() effectively creates its own loop. In my simulation, I had it sum SQL dataset with 20319 records, each with 3 summable fields and found that creating your own loop had a 2X advantage.
I had hoped that LINQ would optimize this away and push the whole burden on the SQL server, but unless I move the sum request into the initial LINQ statement, it executes each request one at a time.

Resources