I'm experiencing performance issues on some of my queries against EF5 DbContext model in my ASP MVC application.
The queries have many includes over multiple levels of navigation graph, e.g.:
Context.Cars
.Include(c=>c.Model.Maker)
.Include(c=>c.CarOwners.Select(co=>co.Owner))
.Include(c=>c.Navigation1)
.Include(c=>c.Navigation2)
.Include(c=>c.Navigation3)
.ToList();
The first time I run a query it takes about 10 seconds to execute, but when I refresh the page the second time it takes less then a second to execute.
I have run Visual Studio's Performance Analysis tool to see where is the problem and it seems that the GetExecutionPlan() method is consuming most of the time.
I guess the plan is being cached since the second time the query is run (on a page refresh) the query is executed really fast (less then a second).
I understand that the performance of fistp page load is limited since the query is really complicated (the SQL code dumped to DB is about 4k lines long). But the problem is that if I return to the page in an hour or so the query is slow again. It seems like the execution plan cache is cleared somehow? I've checked IIS settings and all application pool recycling setting are turned off.
Just to be clear, I'm not looking for methods to optimize my queries, I'm wondering why my query behaves strangely: first load slow, second load fast and load after one hour again slow.
Any ideas?
(dotPeek to the rescue. I couldn't find the class in the source on Codeplex, it may have been removed in v6.)
There's an internal class System.Data.Common.QueryCache.QueryCacheManager in EntityFramework.dll v5.0.0.0, which does what it says, but is a bit complex.
Here's what I'm pretty sure about: There is a timer which is started (if not already running) when a plan is added to the cache. The timer triggers a sweep of the cache every 60000 milliseconds (1 minute), and the cache is then actually swept if there are more than 800 plans cached. Plans which have not been re-used since the last sweep are evicted from the cache. If the cache has fewer than 800 plans in it, the sweep is skipped and the timer is stopped.
Here's what I'm not so sure about: There's part of the cache sweep I don't quite understand, but I assume it's clever. It looks like the algorithm makes it harder for a plan to stay in the cache the more sweeps it lives through, by bitwise shifting its hit count rightward by increasing amounts each sweep. On the first and second sweep it gets shifted 1, then 2, then 4, up to 16. I'm not sure what the reason for this is, and I'm having a hard time figuring out exactly how many times a plan needs to be used for it to stay in the cache more than 5 minutes. I'd appreciate it if anyone could give more information about 1) exactly what it's doing, and 2) what the rationale might be for doing this.
Anyway, that's why your plan isn't being cached forever.
Related
Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.
I ran a search and the first time and it took 3-4 seconds.
I ran the same search second time and it took less than 100 ms (as expected as it used the cache)
Then I cleared the cache by calling "http://host:port/index/_cache/clear"
Next I ran the same search and was expecting it to take 3-4 seconds but it took less than 100 ms
So the clearing of the cache didn't work?
What exactly got cleared by that url?
How I do make ES do the raw search (i.e. no caching) every time?
I am doing as a part of some load testing.
Clearing the cache will empty:
Field data (used by facets, sorting, geo, etc)
Filter cache
parent/child cache
Bloom filters for posting lists
The effect you are seeing is probably due to the OS file system cache. Elasticsearch and Lucene leverage the OS file system cache heavily due to the immutable nature of lucene segments. This means that small indices tend to be cached entirely in memory by your OS and become diskless.
As an aside, it doesn't really make sense to benchmark Elasticsearch in a "cacheless" state. It is designed and built to operate in a cached environment - much of the performance that Elasticsearch is known for is due to it's excellent use of caching.
To be completely accurate, your benchmark should really be looking at a system that has fully warmed the JVM (to properly size the new-eden space, optimize JIT output, etc) and using real, production-like data to simulate "real world" cache filling and eviction on both the ES and OS levels.
Synthetic tests such as "no-cache environment" make little sense.
I don't know if this is what you're experiencing, but the cache isn't cleared immediately when you call clear cache. It is scheduled to be deleted in the next 60 seconds.
source: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-clearcache.html
I have a couch db application and for most of the views I notice that the time taken by the server to return a response varies from 10ms to 100ms. I do not have any concurrent write operations on the server and there are at the most 10 concurrent read requests.
How should I diagnose the problem ? Where you I look ?
I am running it on a rackspace cloud machine with 1GB RAM.
From the Couchdb Guide:
If you read carefully over the last few paragraphs, one part stands out: “When you query your view, CouchDB takes the source code and runs it for you on every document in the database.” If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document.
Most likely you are seeing the views be regenerated and recached.
We have a web service which provides search over hotels. There is a problem with performance: a single request to the service takes around 5000 ms. Almost all of the time is spent in database by executing storing procedures. During the request our server (mssql2008) consumes ~90% of the processor time. When 2 requests are made in parallel the average time grows and is around 7000 ms. When number of request is increasing, the average time of response is increasing as well. We have 20-30 requests per minute.
Which kind of optimization is the best in this case having in mind that the goal is to provide stable response time for the service:
1) Try to decrease the stored procedures execution time
2) Try to find the way how to unload the server
It is interesting to hear from people who deal with booking sites.
It is interesting to hear from people
who deal with booking sites. Thanks!
This has nothing to do with booking sites, you have poorly written stored procedures, possibly no indexes, your queries are probably not SARGAble and it has to scan the table every time. Are you statistics up to date?
run some procs from SSMS and look at the execution plans
Also a good idea to run profiler. How about your page life expectancy and buffer cache hit ratio, take a look at Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters to get those numbers
I think the first thing you have to do is to quantify what's going on on the server.
Use SQL Server Profiler to get an accurate picture of the activity on the server.
Identify which procedures / SQL statements take up the most resources
Identify high priority SQL operations consuming a lot of resources / taking time
Prioritize
Fix
Now, when I say "Fix", I mean that you should execute the procedure / statement manually in SSMS - Make sure you have "Show Execution Plan" turned ON.
Review the execution plan for parts that consume the most resources and then figure out how to correct that. You may need to create a new index, rewrite the SQL to be more efficient by using hints, etc.
You provide no detail to solve your problem. In general to increase performance of a stored procedure I look at:
1) remove any cursors or loops with set based operations
2) make sure all queries are using an index and using an efficient execution plan (check this with SET SHOWPLAN_ALL ON)
3) make sure there is no locking or blocking slowing it down (see the query given here)
without more info on the specifics, it is hard to make any suggestions.
Almost all of the time is spent in
database by executing storing
procedures.
how many procedures is the app calling? what do they do? are transactions involved? are the procedures recompiling each call? do you have any indexes? are statistics up to date? etc., etc... You need to give a lot more info, or any help here is a complete guess.
I use Visual Studio Team System 2008 Team Suite for load testing of my Web-application (it uses ASP.MVC technology).
Load pattern:Constant (this means I have constant amount of virtual users all the time).
I specify coniguratiton of 1000 users to analyze perfomance of my Web-application in really stress conditions.I run the same load test multiple times while making some changes in my application.
But while analyzing load test results I come to a strange dependency: when average page response time becomes larger,the requests per second value increases too!And vice versa:when average page response time is less,requests per second value is less.This situation does not reproduce when the amount of users is small (5-50 users).
How can you explain such results?
Perhaps there is a misunderstanding on the term Requests/Sec here. Requests/Sec as per my understanding is just a representation of how any number of requests that the test is pushing into the application (not the number of requests completed per second).
If you look at it that way. This might make sense.
High Requests/Sec will cause higher Avg. Response Time (due to bottleneck somewhere, i.e. CPU bound, memory bound or IO bound).
So as your Requests/Sec goes up, and you have tons of object in memory, the memory is under pressure, thus triggering the Garbage Collection that will slow down your Response time.
Or as your Requests/Sec goes up, and your CPU got hammered, you might have to wait for CPU time, thus making your Response Time higher.
Or as your Request/Sec goes up, your SQL is not tuned properly, and blocking and deadlocking occurs, thus making your Response Time higher.
These are just examples of why you might see these correlation. You might have to track it down some more in term of CPU, Memory usage and IO (network, disk, SQL, etc.)
A few more details about the problem: we are load testing our rendering engine [NDjango][1] against the standard ASP.NET aspx. The web app we are using to load test is very basic - it consists of 2 static pages - no database, no heavy processing, just rendering. What we see is that in terms of avg response time aspx as expected is considerably faster, but to my surprise the number of requests per second as well as total number of requests for the duration of the test is much lower.
Leaving aside what we are testing against what, I agree with Jimmy, that higher request rate can clog the server in many ways. But it is my understanding that this would cause the response time to go up - right?
If the numbers we are getting really reflect what's happening on the server, I do not see how this rule can be broken. So for now the only explanation I have is that the numbers are skewed - something is wrong with the way we are configuring the tool.
[1]: http://www.ndjango.org NDjango
This is a normal result as the number of users increases you will load the server with higher numbers of requests per second. Any server will take longer to deal with more requests per second, meaning the average page response time increases.
Requests per second is a measure of the load being applied to the application and average page response time is a measure of the applications performance where high number=slow response.
You will be better off using a stepped number of users or a warmup period where the load is applied gradually to the server.
Also, with 1000 virtual users on a single test machine, the CPU of the test machine will be absolutely maxed out. That will most likely be the thing that is skewing the results of your testing. Playing with the number of virtual users you will find that there will be a point where the requests per second are maxed out. Adding or taking away virtual users will result in less requests per second from the app.