I want to use DocumentDB to store roughly 200.000 documents of the same type. The documents each get an integer id field and I would like to retrieve them paged, in reverse order (highest id first).
So recently I found out there is no sorting for DocumentDB (see also DocumentDB - query result order). Perhaps it is better to go for a different database (such as RavenDB) however, time is pressing and I want to avoid the cost of switching to another database.
The question:
I have been looking at implementing my own sorted index of the documents on the client side (ASP Web API 2). I was thinking of creating a SortedList of key(id) and value(document.selflink). Then I could create a Getter with parameters for count, offset and a predicate to filter the documents. Below I added a quick example.
I just have the feeling this is a bad idea; either slow, costing too many resources or can be better done another way. So I am open for implementation suggestions...
public class SortableDocumentDbRepository
{
private SortedList _sorted = new SortedList();
private readonly string _sortedPropertyName;
private DocumentCollection ReadOrCreateCollection(string databaseLink) {
DocumentCollection col = base.ReadOrCreateCollection(databaseLink);
var docs = Client.CreateDocumentQuery(Collection.DocumentsLink)
.AsEnumerable();
lock (_sorted.SyncRoot) {
foreach (Document doc in docs) {
var propVal = doc.GetPropertyValue<string>(_sortedPropertyName);
if (propVal != null) {
_sorted.Add(propVal, doc.SelfLink);
}
}
}
return col;
}
public List<T> GetItems<T>(int count, int offset, Expression<Func<T, bool>> predicate) {
List<T> result = new List<T>();
lock (_sorted.SyncRoot) {
var values = _sorted.GetValueList();
for (int i = offset; i < _sorted.Count; i++) {
var queryable = predicate != null ?
Client.CreateDocumentQuery<T>(values[i].ToString()).Where(predicate) :
Client.CreateDocumentQuery<T>(values[i].ToString());
T item = queryable.AsEnumerable().FirstOrDefault();
if (item == null || item.Equals(default(T))) continue;
result.Add(item);
if (result.Count >= count) return result;
}
}
return result;
}
}
Microsoft has implemented Sorting:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-reference#bk_orderby_clause
Example: SELECT * FROM c ORDER BY c._ts DESC
As you mentioned, order by unfortunately isn't implemented yet.
Your approach looks reasonable to me.
I see you are using a predicate to narrow the query result set (pulling 200,000 records for any DB will be costly).
Since it looks like you are looking to order by id - you can also look in to setting up a range index on id allowing you to perform range queries (e.g. < and >) on the id and further narrow the query result set. There is also a range index included by default on the _ts (timestamp) system property on documents that may also be helpful in this context.
See: http://azure.microsoft.com/en-us/documentation/articles/documentdb-indexing-policies/
Related
I made a new query to select from Article Class with where clause for each item selected. However, it keeps getting the whole list every time although there are selected fields!
Here is my code:
ParseQuery<Article> query = new ParseQuery<Article>();
if (souCategorie.SelectedIndex >= 0)
{
query.WhereEqualTo("idSCategorie", listeSouCategorie.ElementAt(souCategorie.SelectedIndex));
}
if(motcle.Text.Length > 0)
{
query.WhereContains("nom", motcle.Text);
// query.WhereContains("description", motcle.Text);
}
if(distance.Text.Length>0)
if (Convert.ToDouble(distance.Text) > 0)
{
Debug.WriteLine(distance.Text);
ParseGeoPoint geo = new ParseGeoPoint();
geo.Latitude = geoposition.Coordinate.Latitude;
geo.Longitude = geoposition.Coordinate.Longitude;
query.WhereWithinDistance("coordonnees", geo, ParseGeoDistance.FromKilometers(Convert.ToDouble(distance.Text)));
}
IEnumerable<Article> lst = await query.FindAsync();
rechercheResult.DataContext = lst.ToList();
What could possibly be wrong?
I know that queries can do funky stuff when you start trying to use GeoPoint stuff. I would try setting up two queries, one that just queries for objects within a distance, then pass that query into the second query that has the whereEqualTo and whereContains calls.
I am querying 200k records and using up all the server's memory (no surprise). I am new to LINQ so I found the following code that should help me but I don't know how to use it:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
Source: http://goo.gl/aQZIj
Here is my code which creates the "out of memory" error. How do I incorporate the new Batch function into my code?
var crmMetrics = _crmDbContext.tpm_metricsSet.Where(a => a.ModifiedOn >= lastRunDate);
foreach (var crmMetric in crmMetrics)
{
metric = new Metric();
metric.ProductKey = crmMetric.tpm_Product.Id;
dbContext.Metrics.Add(metric);
dbContext.SaveChanges();
}
It's an extension method, so if it is part of a static class and there is a reference to the class's namespace in your code you could do:
var crmMetricsBatches = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.AsEnumerable() // !!
.Batch(20);
Except it wouldn't help. By the .AsEnumerable(), you still fetch all data in memory but now in chunks of 20. This is because you can't use the method directly against IQueryable: Entity Framework will try to translate it to SQL but of course has no clue how to do that.
As said by TGH, Skip and Take are more made for this:
var crmMetricsPage = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.OrderBy(a => a.??) // some property you choose
.Skip(pageNo * pageSize)
.Take(pageSize);
where pageNo counts from 0 to the number of pages (- 1) you're going to need. Skip and Take are expressions, and EF knows how to convert these to SQL. The OrderBy is required for EF to know where to start skipping.
In this process, called paging, you always get pageSize records at a time. The number of queries is greater, but resources are spared. One condition is that you can determine a pageSize in advance. I don't know if this fits with your logic.
If you can't use paging you should try to narrow the filter (Where(a => a.ModifiedOn >= lastRunDate), e.g. try to get the data in batches of one day or week.
I would use Linq's Skip and Take to get the batches
Check this out:
http://www.c-sharpcorner.com/UploadFile/3d39b4/take-and-skip-operator-in-linq-to-sql/
I have an in memory List of objects. I want to check if each one exists in a database and if not, set a bool property on that object to true.
Object
class Part
{
public bool NewPart { get; set; }
public string PartNumber { get; set; }
public string Revision { get; set; }
public string Description { get; set; }
}
List contains the collection of parts. For each part, if it exists in the database then NewPart should be set to FALSE, else TRUE. I'm looking for the most efficient way to do this as there are likely to be hundred of parts so I'm thinking that running a SQL query for each part may not be the most efficient method.
Ideas on the best way to achieve this appreciated.
It depends on which ORM you are using, but with Linq2Sql you can use a query like:
var query = from p in db.parts
where myList.Contains(p.PartNumber)
select p.PartNumber;
You can then use the IEnumerable returned to set your newPart field
As an alternative, if your ultimate goal is to do an Upsert type action, then check out this question and its answers Insert Update stored proc on SQL Server (needs SQL level implementation, not linq)
The following will only hit the database once.
var myList = (from r in parts select r.PartNumber).ToList();
var existingParts = (from r in dc.Parts
where myList.Contains(r.PartNumber) select r.PartNumber).ToList();
foreach(var r in parts)
r.NewPart = existingParts.Contains(r.PartNumber);
Note, the generated sql could very well be something like
SELECT PartNumber
FROM Parts Where PartNumber in (#p0, #p1, #p2, #p3 .... )
so this should work if the parts list of a hundred or so, but not if it is over 2100.
This is one of those cases where the most efficient approach depends upon the actual data.
The first obtains all partNums from the database:
HashSet<int> partNums = new HashSet<int>(from p in GetTable<DBPart> select p.PartNumber);
foreach(var p in parts)
p.NewPart = partNums.Contains(p.PartNumber);
The second queries the database with the relevant partNumbers:
HashSet<int> partNums = new HashSet<int>(
from p in GetTable<DBPart> where (from mp in parts select mp.PartNumber).Contains(p.PartNumber) select p.PartNumber);
foreach(var p in parts)
p.NewPart = partNums.Contains(p.PartNumber);
The former will be more efficient above a certain number of rows in the database, and less efficient above it, because the latter takes a longer time to build a more complicated query, but the former returns everything.
Another factor is the percentage of hits expected. If this number is relatively low (i.e. only a small number of the parts in the list will be in the database) then it could be more efficient to do:
Dictionary<int, Part> dict = partsSource.ToDictionary(p => p.PartNumber, p);
foreach(int pn in
from p in GetTable<DBPart> where (from kv in dict select kv.Key).Contains(p.PartNumber) select p.PartNumber);
dict[pn].NewPart = true;
Where partsSource is the means by which the List parts was obtained in the first place, here instead of obtaining a list, we obtain a dictionary, which makes for more efficient retrieval of those we want. However, it we're going to obtain parts as a list anyway, then we can't really gain here, as we use slightly more effort building the dictionary in the first place, than iterating through the list.
What I'd like to be able to do is construct a LINQ query that retrieved me a few values from some DataRows when one of the fields changes. Here's a contrived example to illustrate:
Observation Temp Time
------------- ---- ------
Cloudy 15.0 3:00PM
Cloudy 16.5 4:00PM
Sunny 19.0 3:30PM
Sunny 19.5 3:15PM
Sunny 18.5 3:30PM
Partly Cloudy 16.5 3:20PM
Partly Cloudy 16.0 3:25PM
Cloudy 16.0 4:00PM
Sunny 17.5 3:45PM
I'd like to retrieve only the entries when the Observation changed from the previous one. So the results would include:
Cloudy 15.0 3:00PM
Sunny 19.0 3:30PM
Partly Cloudy 16.5 3:20PM
Cloudy 16.0 4:00PM
Sunny 17.5 3:45PM
Currently there is code that iterates through the DataRows and does the comparisons and construction of the results but was hoping to use LINQ to accomplish this.
What I'd like to do is something like this:
var weatherStuff = from row in ds.Tables[0].AsEnumerable()
where row.Field<string>("Observation") != weatherStuff.ElementAt(weatherStuff.Count() - 1) )
select row;
But that doesn't work - and doesn't compile since this tries to use the variable 'weatherStuff' before it is declared.
Can what I want to do be done with LINQ? I didn't see another question like it here on SO, but could have missed it.
Here is one more general thought that may be intereting. It's more complicated than what #tvanfosson posted, but in a way, it's more elegant I think :-). The operation you want to do is to group your observations using the first field, but you want to start a new group each time the value changes. Then you want to select the first element of each group.
This sounds almost like LINQ's group by but it is a bit different, so you can't really use standard group by. However, you can write your own version (that's the wonder of LINQ!). You can either write your own extension method (e.g. GroupByMoving) or you can write extension method that changes the type from IEnumerable to some your interface and then define GroupBy for this interface. The resulting query will look like this:
var weatherStuff =
from row in ds.Tables[0].AsEnumerable().AsMoving()
group row by row.Field<string>("Observation") into g
select g.First();
The only thing that remains is to define AsMoving and implement GroupBy. This is a bit of work, but it is quite generally useful thing and it can be used to solve other problems too, so it may be worth doing it :-). The summary of my post is that the great thing about LINQ is that you can customize how the operators behave to get quite elegant code.
I haven't tested it, but the implementation should look like this:
// Interface & simple implementation so that we can change GroupBy
interface IMoving<T> : IEnumerable<T> { }
class WrappedMoving<T> : IMoving<T> {
public IEnumerable<T> Wrapped { get; set; }
public IEnumerator<T> GetEnumerator() {
return Wrapped.GetEnumerator();
}
public IEnumerator<T> GetEnumerator() {
return ((IEnumerable)Wrapped).GetEnumerator();
}
}
// Important bits:
static class MovingExtensions {
public static IMoving<T> AsMoving<T>(this IEnumerable<T> e) {
return new WrappedMoving<T> { Wrapped = e };
}
// This is (an ugly & imperative) implementation of the
// group by as described earlier (you can probably implement it
// more nicely using other LINQ methods)
public static IEnumerable<IEnumerable<T>> GroupBy<T, K>(this IEnumerable<T> source,
Func<T, K> keySelector) {
List<T> elementsSoFar = new List<T>();
IEnumerator<T> en = source.GetEnumerator();
if (en.MoveNext()) {
K lastKey = keySelector(en.Current);
do {
K newKey = keySelector(en.Current);
if (newKey != lastKey) {
yield return elementsSoFar;
elementsSoFar = new List<T>();
}
elementsSoFar.Add(en.Current);
} while (en.MoveNext());
yield return elementsSoFar;
}
}
You could use the IEnumerable extension that takes an index.
var all = ds.Tables[0].AsEnumerable();
var weatherStuff = all.Where( (w,i) => i == 0 || w.Field<string>("Observation") != all.ElementAt(i-1).Field<string>("Observation") );
This is one of those instances where the iterative solution is actually better than the set-based solution in terms of both readability and performance. All you really want Linq to do is filter and pre-sort the list if necessary to prepare it for the loop.
It is possible to write a query in SQL Server (or various other databases) using windowing functions (ROW_NUMBER), if that's where your data is coming from, but very difficult to do in pure Linq without making a much bigger mess.
If you're just trying to clean the code up, an extension method might help:
public static IEnumerable<T> Changed(this IEnumerable<T> items,
Func<T, T, bool> equalityFunc)
{
if (equalityFunc == null)
{
throw new ArgumentNullException("equalityFunc");
}
T last = default(T);
bool first = true;
foreach (T current in items)
{
if (first || !equalityFunc(current, last))
{
yield return current;
}
last = current;
first = false;
}
}
Then you can call this with:
var changed = rows.Changed((r1, r2) =>
r1.Field<string>("Observation") == r2.Field<string>("Observation"));
I think what you are trying to accomplish is not possible using the "syntax suggar". However it could be possible using the extension method Select that pass the index of the item you are evaluating. So you could use the index to compare the current item with the previous one (index -1).
You could useMorelinq's GroupAdjacent() extension method
GroupAdjacent: Groups the adjacent elements of a sequence according to
a specified key selector function...This method has 4 overloads.
You would use it like this with the result selector overload to lose the IGrouping key:-
var weatherStuff = ds.Tables[0].AsEnumerable().GroupAdjacent(w => w.Field<string>("Observation"), (_, val) => val.Select(v => v));
This is a very popular extension to default Linq methods, with more than 1M downloads on Nuget (compared to MS's own Ix.net with ~40k downloads at time of writing)
Let's say I have an array, and I want to do a LINQ query against a varchar that returns any records that have an element of the array anywhere in the varchar.
Something like this would be sweet.
string[] industries = { "airline", "railroad" }
var query = from c in contacts where c.industry.LikeAnyElement(industries) select c
Any ideas?
This is actually an example I use in my "Express Yourself" presentation, for something that is hard to do in regular LINQ; As far as I know, the easiest way to do this is by writing the predicate manually. I use the example below (note it would work equally for StartsWith etc):
using (var ctx = new NorthwindDataContext())
{
ctx.Log = Console.Out;
var data = ctx.Customers.WhereTrueForAny(
s => cust => cust.CompanyName.Contains(s),
"a", "de", "s").ToArray();
}
// ...
public static class QueryableExt
{
public static IQueryable<TSource> WhereTrueForAny<TSource, TValue>(
this IQueryable<TSource> source,
Func<TValue, Expression<Func<TSource, bool>>> selector,
params TValue[] values)
{
return source.Where(BuildTrueForAny(selector, values));
}
public static Expression<Func<TSource, bool>> BuildTrueForAny<TSource, TValue>(
Func<TValue, Expression<Func<TSource, bool>>> selector,
params TValue[] values)
{
if (selector == null) throw new ArgumentNullException("selector");
if (values == null) throw new ArgumentNullException("values");
if (values.Length == 0) return x => true;
if (values.Length == 1) return selector(values[0]);
var param = Expression.Parameter(typeof(TSource), "x");
Expression body = Expression.Invoke(selector(values[0]), param);
for (int i = 1; i < values.Length; i++)
{
body = Expression.OrElse(body,
Expression.Invoke(selector(values[i]), param));
}
return Expression.Lambda<Func<TSource, bool>>(body, param);
}
}
from c in contracts
where industries.Any(i => i == c.industry)
select c;
something like that. use the any method on the collection.
IEnumerable.Contains() translates to SQL IN as in:
WHERE 'american airlines' IN ('airline', 'railroad') -- FALSE
String.Contains() which translates to SQL LIKE %...% as in:
WHERE 'american airlines' LIKE '%airline%' -- TRUE
If you want the contacts where the contact's industry is LIKE (contains) any of the given industries, you want to combine both Any() and String.Contains() into something like this:
string[] industries = { "airline", "railroad" };
var query = from c in contacts
where industries.Any(i => c.Industry.Contains(i))
select c;
However, combining both Any() and String.Contains() like this is NOT supported in LINQ to SQL. If the set of given industries is small, you can try something like:
where c.Industry.Contains("airline") ||
c.Industry.Contains("railroad") || ...
Or (although normally not recommended) if the set of contacts is small enough, you could bring them all from the DB and apply the filter with LINQ to Objects by using contacts.AsEnumerable() or contacts.ToList() as the source of the query above:
var query = from c in contacts.AsEnumerable()
where industries.Any(i => c.Industry.Contains(i))
select c;
it will work if you build up the query as follows:
var query = from c in contacts.AsEnumerable()
select c;
query = query.Where(c=> (c.Industry.Contains("airline")) || (c.Industry.Contains("railroad")));
you just need to programmatically generate the string above if the parameters airline and railroad are user inputs. This was in fact a little more complicated than I was expecting. See article - http://www.albahari.com/nutshell/predicatebuilder.aspx
Unfortunately, LIKE is not supported in LINQ to SQL as per here:
http://msdn.microsoft.com/en-us/library/bb882677.aspx
To get around this, you will have to write a stored procedure which will accept the parameters you want to use in the like statement(s) and then call that from LINQ to SQL.
It should be noted that a few of the answers suggest using Contains. This won't work because it looks to see that the entire string matches the array element. What is being looked for is for the array element to be contained in the field itself, something like:
industry LIKE '%<element>%'
As Clark has mentioned in a comment, you could use a call to IndexOf on each element (which should translate to a SQL call):
string[] industries = { "airline", "railroad" }
var query =
from c in contacts
where
c.industry.IndexOf(industries[0]) != -1 ||
c.industry.IndexOf(industries[1]) != -1
If you know the length of the array and the number of elements, then you could hard-code this. If you don't, then you will have to create the Expression instance based on the array and the field you are looking at.