SharePoint Library Query running very slow - performance

We have a SharePoint query running on a library of maybe 500 documents to pull back the highest version of the published documents that are flagged as active (have active=true in the Active column).
This is taking way too long to run (about 3-5 seconds), which is frustrating the users.
What can be done to the query below to speed it up (I would hope for this to be virtually instantaneous!)
using (var site = new SPSite(Helpers.ConfigurationFile.SharePointUrl().ToString()))
{
using (var web = site.RootWeb)
{
return
web.Folders["Templates"].Files.OfType<SPFile>()
.Where(file => file.Item.HasPublishedVersion)
.Where(file => file.Item.Properties["Active"].ToString() == "true")
.Where(file => file.Item.Versions.OfType<SPListItemVersion>()
.Any(x => x.Level == SPFileLevel.Published))
.Select(file => new LibraryItem(
file.Item.UniqueId,
file.Item.ID,
file.Item.Title,
Helpers.ConfigurationFile.SharePointUrl()
.ToString().AddPathSegment(file.Url),
true,
float.Parse(file.MajorVersion.ToString()
+ "." + file.MinorVersion.ToString())
)
)
.ToList();
}
}

A few points:
Aren't the HasPublishedVersion and file.Item.Versions checks doing the same thing - checking whether a published version of the SPFile exists?
The Any query on the SPListItemVersions will search them in order from Versions[0] to Versions[count - 1] which translates to 'newest to oldest' (since Versions[0] is the latest version in SP). If you think it's likely that your docs were published more frequently soon after creation than they have been in recent times, it'll possibly be faster to perform the same check by looping backwards through the collection (from oldest version to newest).
You could index on the Active column.
You can optimise it by performing part of the query using CAML (it's horrible, I know). I would try to delegate the SPFile fetching and Active checking to a CAML query, then do the rest of it using the object model. See this link
If you do use CAML, PortalSiteMapProvider.GetCachedListItemsByQuery() might prove to be faster SPList.GetItems().

Related

Uniqueness check in Elasticsearch without constantly refreshing index

I'm indexing a lot of data in Elasticsearch (through NEST) from multiple processes each running multiple threads. Part of indexing a document is finding out if we have seen a similar document before. This feature is implemented by generating a hash of a set of fields on the document and checking if we have documents in Elasticsearch with the same hash. Before indexing a document, I make the following query:
var result = elasticClient
.Index(indexName)
.Count<MyDocument>(c => c
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash))))
...
This returns a count of existing documents with the specified hash. So far so good. Things are working. If a process is indexing two documents with the same hash within the same second, the count check doesn't work, since the first document isn't available for search yet. I'm running with the default refresh interval (1 second). For now I have added a refresh call after indexing each document:
var refreshResponse = client.Refresh(indexName);
This also works but it doesn't scale when indexing large amounts of documents (indexing becomes slow as already pointed out here: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
Any ideas for how to avoid having to call Refresh but still be able to perform a uniqueness check? I'm thinking some sort of local cache shared between all threads with hashes of documents indexed since the last refresh. I know that this won't work across processes, but that is acceptaple for now.
I ended up implementing a write-through cache as suggested by Val. This makes it possible to remove the call to Refresh but still make the count on each iteration. This is implemented using a singleton MemoryCache shared between all threads:
var cache = new MemoryCache("hashes");
When checking for uniqueness I check the cache in case no similar documents are found in Elasticsearch:
var result = elasticClient
.Count<MyDocument>(c => c
.Index(indexName)
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash)))));
bool isUnique = false;
if (result.Count == 0)
{
isUnique = !cache.Contains(hash);
}
In case the count for the hash returns 0 I check a cache for that hash.
When a document has been successful indexed, I store the hash in the cache with an expiration:
var policy = new CacheItemPolicy();
policy.AbsoluteExpiration = DateTimeOffset.UtcNow.AddSeconds(5);
cache.AddOrGetExisting(hash, string.Empty, policy);
TTL could probably be 1 second as well since that is the refresh interval I currently have configured on the index.

System.Data.EntityProxies group by keeps doing tons of selects GroupBy linq

I have inherited this project where it is using dynamic proxies for EF6. From a repo it returns IqueryAble(proxyObject).
I can watch SQL profiler and see it returns 6000+ records.. great! so far so good.
At this point I then create 3 lists aginst that dataset (say 3 X 2000 records)
Because each of those has filter logic I can also see a call to the db to return the list. Great! So far 4 calls to the DB and 6000 records.
THE PROBLEM
Every time I run this group by... I get 2000 calls to the DB! One Call for each record in Sublist. My guess it because it needs to inflate the object each time? Its terribly slow however.
var lts = Sublist.GroupBy(p => p.proxyObject.ProvinceCode)
.Select(n => new CountModel()
{
TypeName = n.Key,
ItemCount = n.Count()
}).ToList();
PresentationModel.AddRange(lts);
I ended up resolving this issue by Selecting into a new POCO directly. What was happening was that because the entity was a combination of a bunch of underlying repos...it ended up that EF was just creating a ton of individual queries and then rolling them up
var last3Months =
ProxyEntity.Where(l => DateTime.now() <= l.EffectiveDate)
.Select(l => new ModelMicro()
{
x= l.x,
y= l.y,
z= l.z
});
This resulted in one single call to the db returning 6000 rows that I could then shift where i needed. I then grouped on z in my underlying return. Most likely ways to make it better.

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

Slow query over large collection

I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.

Chaining to a compiled query loses performance benefit

I started using compiled queries to increase the performance of some commonly executed linq to entities queries. In one scenario I only boiled the query down to it's most basic form and pre-compiled that, then I tack on additional where clauses based on user input.
I seem to be losing the performance benefit of compiled queries in this particular case. Can someone explain why?
Here's an example of what I'm doing...
IEnumerable<Task> tasks = compiledQuery.Invoke(context, userId);
if(status != null)
{
tasks = tasks.Where(x=x.Status == status);
}
if(category != null)
{
tasks = tasks.Where(x=x.Category == category);
}
return tasks;
I think it's important to understand how Compiled Queries in EF work.
When you execute a query Entity Framework will map your expression tree with the help of your mapping file (EDMX or with code first your model definitions) to a SQL query. This can be a complex and performance intensive task.
Precompiling stores the results of these mapping phase so the next time you hit the query it has the SQL already available and it only has to set the current parameters.
The problem is that a precompiled query will lose it's performance benefit as soon as you modifie the query. Let's say you have the following:
IQueryable query = GetCompiledQuery(); // => db.Tasks.Where(t => t.Id == myId);
var notModifiedResult = query.ToList(); // Fast
int ModifiedResult = query.Count(); // Slow
With the first query you will have all the benefits of precompiling because EF has the SQL already generated for you and can execute this immediatly.
The second query will lose the precompiling because it has to regenerate it's SQL.
If you would now execute a query on notModifiedResult this will be a Linq To Objects one because you have already executed your SQL to the database and fetched all the elements in memory.
You can however chain Compiled Queries (that is, use a compiled query in another compiled query).
But your code would require a series of compiled queries:
- The default
- One where status != null
- One where category != null
- One where both status and category != null
(Note: I haven't done any EF work for ages, and then it was just pottering. This is just an informed guess, really.)
This could be the culprit:
IEnumerable<Task> tasks = compiledQuery.Invoke(context, userId);
Any further querying will have to be done within the .NET process, not in SQL. All the possible results will have to be fetched from the database and filtered locally. Try this instead:
IQueryable<Task> tasks = compiledQuery.Invoke(context, userId);
(Assuming that's valid, of course.)
The compiled query can't be changed, only the parameters can be changed. What you are doing here is actually running the query, and THEN filtering the results.
.Invoke(context, userId); // returns all the results
.Where(....) // filters on that entire collection
You can see if there is a clever way to restate your query, so that the parameters can be included in all cases, but not have any effect. I haven't worked with compiled queries, sorry about that, but does this work (using -1 as the "ignore" value)?
// bunch of code to define the compiled query part, copied from [msdn][1]
(ctx, total) => from order in ctx.SalesOrderHeaders
where (total == -1 || order.TotalDue >= total)
select order);
In SQL, you do this by either using dynamic sql, or having a default value (or null) that you pass in which indicates that parameter should be ignored
select * from table t
where
(#age = 0 or t.age = #age) and
(#weight is null or t.weight = #weight)

Resources