Can I use LINQ to retrieve only "on change" values? - linq

What I'd like to be able to do is construct a LINQ query that retrieved me a few values from some DataRows when one of the fields changes. Here's a contrived example to illustrate:
Observation Temp Time
------------- ---- ------
Cloudy 15.0 3:00PM
Cloudy 16.5 4:00PM
Sunny 19.0 3:30PM
Sunny 19.5 3:15PM
Sunny 18.5 3:30PM
Partly Cloudy 16.5 3:20PM
Partly Cloudy 16.0 3:25PM
Cloudy 16.0 4:00PM
Sunny 17.5 3:45PM
I'd like to retrieve only the entries when the Observation changed from the previous one. So the results would include:
Cloudy 15.0 3:00PM
Sunny 19.0 3:30PM
Partly Cloudy 16.5 3:20PM
Cloudy 16.0 4:00PM
Sunny 17.5 3:45PM
Currently there is code that iterates through the DataRows and does the comparisons and construction of the results but was hoping to use LINQ to accomplish this.
What I'd like to do is something like this:
var weatherStuff = from row in ds.Tables[0].AsEnumerable()
where row.Field<string>("Observation") != weatherStuff.ElementAt(weatherStuff.Count() - 1) )
select row;
But that doesn't work - and doesn't compile since this tries to use the variable 'weatherStuff' before it is declared.
Can what I want to do be done with LINQ? I didn't see another question like it here on SO, but could have missed it.

Here is one more general thought that may be intereting. It's more complicated than what #tvanfosson posted, but in a way, it's more elegant I think :-). The operation you want to do is to group your observations using the first field, but you want to start a new group each time the value changes. Then you want to select the first element of each group.
This sounds almost like LINQ's group by but it is a bit different, so you can't really use standard group by. However, you can write your own version (that's the wonder of LINQ!). You can either write your own extension method (e.g. GroupByMoving) or you can write extension method that changes the type from IEnumerable to some your interface and then define GroupBy for this interface. The resulting query will look like this:
var weatherStuff =
from row in ds.Tables[0].AsEnumerable().AsMoving()
group row by row.Field<string>("Observation") into g
select g.First();
The only thing that remains is to define AsMoving and implement GroupBy. This is a bit of work, but it is quite generally useful thing and it can be used to solve other problems too, so it may be worth doing it :-). The summary of my post is that the great thing about LINQ is that you can customize how the operators behave to get quite elegant code.
I haven't tested it, but the implementation should look like this:
// Interface & simple implementation so that we can change GroupBy
interface IMoving<T> : IEnumerable<T> { }
class WrappedMoving<T> : IMoving<T> {
public IEnumerable<T> Wrapped { get; set; }
public IEnumerator<T> GetEnumerator() {
return Wrapped.GetEnumerator();
}
public IEnumerator<T> GetEnumerator() {
return ((IEnumerable)Wrapped).GetEnumerator();
}
}
// Important bits:
static class MovingExtensions {
public static IMoving<T> AsMoving<T>(this IEnumerable<T> e) {
return new WrappedMoving<T> { Wrapped = e };
}
// This is (an ugly & imperative) implementation of the
// group by as described earlier (you can probably implement it
// more nicely using other LINQ methods)
public static IEnumerable<IEnumerable<T>> GroupBy<T, K>(this IEnumerable<T> source,
Func<T, K> keySelector) {
List<T> elementsSoFar = new List<T>();
IEnumerator<T> en = source.GetEnumerator();
if (en.MoveNext()) {
K lastKey = keySelector(en.Current);
do {
K newKey = keySelector(en.Current);
if (newKey != lastKey) {
yield return elementsSoFar;
elementsSoFar = new List<T>();
}
elementsSoFar.Add(en.Current);
} while (en.MoveNext());
yield return elementsSoFar;
}
}

You could use the IEnumerable extension that takes an index.
var all = ds.Tables[0].AsEnumerable();
var weatherStuff = all.Where( (w,i) => i == 0 || w.Field<string>("Observation") != all.ElementAt(i-1).Field<string>("Observation") );

This is one of those instances where the iterative solution is actually better than the set-based solution in terms of both readability and performance. All you really want Linq to do is filter and pre-sort the list if necessary to prepare it for the loop.
It is possible to write a query in SQL Server (or various other databases) using windowing functions (ROW_NUMBER), if that's where your data is coming from, but very difficult to do in pure Linq without making a much bigger mess.
If you're just trying to clean the code up, an extension method might help:
public static IEnumerable<T> Changed(this IEnumerable<T> items,
Func<T, T, bool> equalityFunc)
{
if (equalityFunc == null)
{
throw new ArgumentNullException("equalityFunc");
}
T last = default(T);
bool first = true;
foreach (T current in items)
{
if (first || !equalityFunc(current, last))
{
yield return current;
}
last = current;
first = false;
}
}
Then you can call this with:
var changed = rows.Changed((r1, r2) =>
r1.Field<string>("Observation") == r2.Field<string>("Observation"));

I think what you are trying to accomplish is not possible using the "syntax suggar". However it could be possible using the extension method Select that pass the index of the item you are evaluating. So you could use the index to compare the current item with the previous one (index -1).

You could useMorelinq's GroupAdjacent() extension method
GroupAdjacent: Groups the adjacent elements of a sequence according to
a specified key selector function...This method has 4 overloads.
You would use it like this with the result selector overload to lose the IGrouping key:-
var weatherStuff = ds.Tables[0].AsEnumerable().GroupAdjacent(w => w.Field<string>("Observation"), (_, val) => val.Select(v => v));
This is a very popular extension to default Linq methods, with more than 1M downloads on Nuget (compared to MS's own Ix.net with ~40k downloads at time of writing)

Related

Sorting for Azure DocumentDB

I want to use DocumentDB to store roughly 200.000 documents of the same type. The documents each get an integer id field and I would like to retrieve them paged, in reverse order (highest id first).
So recently I found out there is no sorting for DocumentDB (see also DocumentDB - query result order). Perhaps it is better to go for a different database (such as RavenDB) however, time is pressing and I want to avoid the cost of switching to another database.
The question:
I have been looking at implementing my own sorted index of the documents on the client side (ASP Web API 2). I was thinking of creating a SortedList of key(id) and value(document.selflink). Then I could create a Getter with parameters for count, offset and a predicate to filter the documents. Below I added a quick example.
I just have the feeling this is a bad idea; either slow, costing too many resources or can be better done another way. So I am open for implementation suggestions...
public class SortableDocumentDbRepository
{
private SortedList _sorted = new SortedList();
private readonly string _sortedPropertyName;
private DocumentCollection ReadOrCreateCollection(string databaseLink) {
DocumentCollection col = base.ReadOrCreateCollection(databaseLink);
var docs = Client.CreateDocumentQuery(Collection.DocumentsLink)
.AsEnumerable();
lock (_sorted.SyncRoot) {
foreach (Document doc in docs) {
var propVal = doc.GetPropertyValue<string>(_sortedPropertyName);
if (propVal != null) {
_sorted.Add(propVal, doc.SelfLink);
}
}
}
return col;
}
public List<T> GetItems<T>(int count, int offset, Expression<Func<T, bool>> predicate) {
List<T> result = new List<T>();
lock (_sorted.SyncRoot) {
var values = _sorted.GetValueList();
for (int i = offset; i < _sorted.Count; i++) {
var queryable = predicate != null ?
Client.CreateDocumentQuery<T>(values[i].ToString()).Where(predicate) :
Client.CreateDocumentQuery<T>(values[i].ToString());
T item = queryable.AsEnumerable().FirstOrDefault();
if (item == null || item.Equals(default(T))) continue;
result.Add(item);
if (result.Count >= count) return result;
}
}
return result;
}
}
Microsoft has implemented Sorting:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-reference#bk_orderby_clause
Example: SELECT * FROM c ORDER BY c._ts DESC
As you mentioned, order by unfortunately isn't implemented yet.
Your approach looks reasonable to me.
I see you are using a predicate to narrow the query result set (pulling 200,000 records for any DB will be costly).
Since it looks like you are looking to order by id - you can also look in to setting up a range index on id allowing you to perform range queries (e.g. < and >) on the id and further narrow the query result set. There is also a range index included by default on the _ts (timestamp) system property on documents that may also be helpful in this context.
See: http://azure.microsoft.com/en-us/documentation/articles/documentdb-indexing-policies/

LINQ Out of Memory Error

I am querying 200k records and using up all the server's memory (no surprise). I am new to LINQ so I found the following code that should help me but I don't know how to use it:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
Source: http://goo.gl/aQZIj
Here is my code which creates the "out of memory" error. How do I incorporate the new Batch function into my code?
var crmMetrics = _crmDbContext.tpm_metricsSet.Where(a => a.ModifiedOn >= lastRunDate);
foreach (var crmMetric in crmMetrics)
{
metric = new Metric();
metric.ProductKey = crmMetric.tpm_Product.Id;
dbContext.Metrics.Add(metric);
dbContext.SaveChanges();
}
It's an extension method, so if it is part of a static class and there is a reference to the class's namespace in your code you could do:
var crmMetricsBatches = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.AsEnumerable() // !!
.Batch(20);
Except it wouldn't help. By the .AsEnumerable(), you still fetch all data in memory but now in chunks of 20. This is because you can't use the method directly against IQueryable: Entity Framework will try to translate it to SQL but of course has no clue how to do that.
As said by TGH, Skip and Take are more made for this:
var crmMetricsPage = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.OrderBy(a => a.??) // some property you choose
.Skip(pageNo * pageSize)
.Take(pageSize);
where pageNo counts from 0 to the number of pages (- 1) you're going to need. Skip and Take are expressions, and EF knows how to convert these to SQL. The OrderBy is required for EF to know where to start skipping.
In this process, called paging, you always get pageSize records at a time. The number of queries is greater, but resources are spared. One condition is that you can determine a pageSize in advance. I don't know if this fits with your logic.
If you can't use paging you should try to narrow the filter (Where(a => a.ModifiedOn >= lastRunDate), e.g. try to get the data in batches of one day or week.
I would use Linq's Skip and Take to get the batches
Check this out:
http://www.c-sharpcorner.com/UploadFile/3d39b4/take-and-skip-operator-in-linq-to-sql/

Why is performance so bad on this EF data import method?

I have a loop, abridged here, that performs an import of employee records, as follows:
var cts2005 = new Cts2005Entities();
IEmployeeRepository repository = new EmployeeRepository();
foreach (var c in cts2005.Candidates)
{
var e = new Employee();
e.RefNum = c.CA_EMP_ID;
e.TitleId = GetTitleId(c.TITLE);
e.Initials = c.CA_INITIALS;
e.Surname = c.CA_SURNAME;
repository.Insert(e);
}
There are actually several more fields, and a total of nine lookups like GetTitleId(c.TITLE)
above. Code for these is all exactly like this:
private List<Title> _titles;
private Guid GetTitleId(string titleName)
{
ITitleRepository repository = new TitleRepository();
if (_titles == null)
{
_titles = repository.ListAll().ToList();
}
var title = _titles.FirstOrDefault(t => String.Compare(t.Name, titleName, StringComparison.OrdinalIgnoreCase) == 0);
if (title == null)
{
title = new Title { Name = titleName };
repository.Insert(title);
_titles.Add(title);
}
return title.Id;
}
All repository.Insert() calls look like this, except the entity types differ:
public void Insert(Employee entity)
{
CurrentDbContext.Employees.Add(entity);
CurrentDbContext.SaveChanges();
}
And all PK's are Guid. I know this could be a small problem, but I didn't expect it to have such a large effect with small volumes like this.
I have done no tuning or optimization on this routine yet, as it was only for my small test db, but yesterday I was forced to do a surprise import of 6000 records. Toward the end, this processed had slowed to about 1 sec per record, which is quite dismal. I wouldn't have expected high speeds without some tuning, but nothing as low as that.
Is there anything obviously, grossly wrong with my method here?
Your GetTitleId is pulling all Titles from the database once down into the application, but it is doing a linear search over all of them for each "Candidate". That is likely to be very expensive. Use a client-side hashtable with a StringComparer.OrdinalIgnoreCase.
Also, why don't you profile your application? Put load on it and hit break in the debugger 10 times. Where does it stop most of the time? This is the hot-spot.
Thanks to the comment from marc_s above, I changed the routine to only call SaveChanges every 500 inserts, and the speed improved with about 500%.

linq styling, chaining where clause vs and operator

Is there a (logical/performance) difference to writing:
ATable.Where(x=> condition1 && condition2 && condition3)
or
ATable.Where(x=>condition1).Where(x=>condition2).Where(x=>condition3)
I've been using the former but realised that with the latter, I can read and copy parts of a query out to use somewhere else easier.
Any thoughts?
Short answer
You should do what you feel is more readable and maintainable in your application as both will evaluate to the same collection.
Long answer quite long
Linq To Objects
ATable.Where(x=> condition1 && condition2 && condition3)
For this example Since there is only one predicate statement the compiler will only needs to generate one delegate and one compiler generated method.
From reflector
if (CS$<>9__CachedAnonymousMethodDelegate4 == null)
{
CS$<>9__CachedAnonymousMethodDelegate4 = new Func<ATable, bool>(null, (IntPtr) <Main>b__0);
}
Enumerable.Where<ATable>(tables, CS$<>9__CachedAnonymousMethodDelegate4).ToList<ATable>();
The compiler generated method:
[CompilerGenerated]
private static bool <Main>b__0(ATable m)
{
return ((m.Prop1 && m.Prop2) && m.Prop3);
}
As you can see there is only one call into Enumerable.Where<T> with the delegate as expected since there was only one Where extension method.
ATable.Where(x=>condition1).Where(x=>condition2).Where(x=>condition3) now for this example a lot more code is generated.
if (CS$<>9__CachedAnonymousMethodDelegate5 == null)
{
CS$<>9__CachedAnonymousMethodDelegate5 = new Func<ATable, bool>(null, (IntPtr) <Main>b__1);
}
if (CS$<>9__CachedAnonymousMethodDelegate6 == null)
{
CS$<>9__CachedAnonymousMethodDelegate6 = new Func<ATable, bool>(null, (IntPtr) <Main>b__2);
}
if (CS$<>9__CachedAnonymousMethodDelegate7 == null)
{
CS$<>9__CachedAnonymousMethodDelegate7 = new Func<ATable, bool>(null, (IntPtr) <Main>b__3);
}
Enumerable.Where<ATable>(Enumerable.Where<ATable>(Enumerable.Where<ATable>(tables, CS$<>9__CachedAnonymousMethodDelegate5), CS$<>9__CachedAnonymousMethodDelegate6), CS$<>9__CachedAnonymousMethodDelegate7).ToList<ATable>();
Since we have three chained Extension methods we also get three Func<T>s and also three compiler generated methods.
[CompilerGenerated]
private static bool <Main>b__1(ATable m)
{
return m.Prop1;
}
[CompilerGenerated]
private static bool <Main>b__2(ATable m)
{
return m.Prop2;
}
[CompilerGenerated]
private static bool <Main>b__3(ATable m)
{
return m.Prop3;
}
Now this looks like this should be slower since heck there is a ton more code. However since all execution is deferred until GetEnumerator() is called I doubt any noticeable difference will present itself.
Some Gotchas that could effect performance
Any call to GetEnumerator in the chain will cause a the collection to be iterated. ATable.Where().ToList().Where().ToList() will result in an iteration of the collection with the first predicate when the ToList is called and then another iteration with the second ToList. Try to keep the GetEnumerator called to the very last moment to reduce the number of times the collection is iterated.
Linq To Entities
Since we are using IQueryable<T> now our compiler generated code is a bit different as we are using Expresssion<Func<T, bool>> instead of our normal Func<T, bool>
Example in all in one.
var allInOneWhere = entityFrameworkEntities.MovieSets.Where(m => m.Name == "The Matrix" && m.Id == 10 && m.GenreType_Value == 3);
This generates one heck of a statement.
IQueryable<MovieSet> allInOneWhere = Queryable.Where<MovieSet>(entityFrameworkEntities.MovieSets, Expression.Lambda<Func<MovieSet, bool>>(Expression.AndAlso(Expression.AndAlso(Expression.Equal(Expression.Property(CS$0$0000 = Expression.Parameter(typeof(MovieSet), "m"), (MethodInfo) methodof(MovieSet.get_Name)), ..tons more stuff...ParameterExpression[] { CS$0$0000 }));
The most notable is that we end up with one Expression tree that is parsed down to Expression.AndAlso pieces. And also like expected we only have one call to Queryable.Where
var chainedWhere = entityFrameworkEntities.MovieSets.Where(m => m.Name == "The Matrix").Where(m => m.Id == 10).Where(m => m.GenreType_Value == 3);
I wont even bother pasting in the compiler code for this, way to long. But in short we end up with Three calls to Queryable.Where(Queryable.Where(Queryable.Where())) and three expressions. This again is expected as we have three chained Where clauses.
Generated Sql
Like IEnumerable<T> IQueryable<T> also does not execute until the enumerator is called. Because of this we can be happy to know that both produce the same exact sql statement:
SELECT
[Extent1].[AtStore_Id] AS [AtStore_Id],
[Extent1].[GenreType_Value] AS [GenreType_Value],
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name]
FROM [dbo].[MovieSet] AS [Extent1]
WHERE (N'The Matrix' = [Extent1].[Name]) AND (10 = [Extent1].[Id]) AND (3 = [Extent1].[GenreType_Value])
Some Gotchas that could effect performance
Any call to GetEnumerator in the chain will cause a call out to sql, e.g. ATable.Where().ToList().Where() will actually query sql for all records matching the first predicate and then filter the list with linq to objects with the second predicate.
Since you mention extracting the predicates to use else where, make sure they are in the form of Expression<Func<T, bool>> and not simply Func<T, bool>. The first can be parsed to an expression tree and converted into valid sql, the second will trigger ALL OBJECTS returned and the Func<T, bool> will execute on that collection.
I hope this was a bit helpful to answer your question.

Help converting foreach to Linq

this is a homework question.
I am doing diffs on our domain model and I've set it up so that I can iterate a list of operations that check for certian differences within the domain. I pass in the differencing function and the before and after states of the object graph to produce a result in the DiffContext - which is used later to set up a payload for calling another service. But I've made some changes and need help with the Linq syntax
So, I have the following code ...
public static IEnumerable<DiffContext> GetFirstDifference<T>(IEnumerable<Func<T, T, DiffContext>> diffOperations, T beforeState, T afterState)
{
return from op in diffOperations
let diff = op(beforeState, afterState)
where diff.FoundDifference
select diff;
}
Which I modified to use Func<T, T, IEnumerable<DiffContext>> instead of the previous Func<T, T, DiffContext> - because now my diff operations can return multiple differences. Like so..
public static IEnumerable<DiffContext> GetFirstDifference<T>(IEnumerable<Func<T, T, IEnumerable<DiffContext>>> diffOperations, T beforeState, T afterState)
{
foreach (var op in diffOperations)
{
foreach (var diff in op(beforeState, afterState))
{
yield return diff;
}
}
}
But now I have this nested foreach and I'd like some help converting it to the Linq equivalent. Can you help?
Thanks Jon Skeet. I now have the following instead of the nested foreach:
return from op in diffOperations
from diff in op(beforeState, afterState)
where diff.FoundDifference
select diff;
Yup - you want two "from" clauses, basically - that performs a flatten operation. This uses the SelectMany LINQ operator.
Given that this is homework, I'm reluctant to post the full code - but I will say it's a three-line LINQ query (using the natural line breaking). Think about what you want "from" each collection...
Just add comments if that's not enough of a hint.

Resources