I'm querying my database whit ES 2.3.1 and I've been measuring the times responses, but I got three different times.
First I measured the time of the first query on database. It takes about 9 seconds.
Second time I measured, I closed the ES, cleared the RAM and cache and query again. It takes about 1,2 seconds.
The third time I queried without cleaning caches and it takes 97 ms.
Can anyone explain way it happens?
The last measure I know that its faster because the data already queried is on cache. I think the first time takes more time because the data have to be pulled on cache.
For me, when I clear the cache and RAM, the time of the second measuring had to be equal the first measure, but no. Can someone explain me way?
Related
Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.
I'm experiencing performance issues on some of my queries against EF5 DbContext model in my ASP MVC application.
The queries have many includes over multiple levels of navigation graph, e.g.:
Context.Cars
.Include(c=>c.Model.Maker)
.Include(c=>c.CarOwners.Select(co=>co.Owner))
.Include(c=>c.Navigation1)
.Include(c=>c.Navigation2)
.Include(c=>c.Navigation3)
.ToList();
The first time I run a query it takes about 10 seconds to execute, but when I refresh the page the second time it takes less then a second to execute.
I have run Visual Studio's Performance Analysis tool to see where is the problem and it seems that the GetExecutionPlan() method is consuming most of the time.
I guess the plan is being cached since the second time the query is run (on a page refresh) the query is executed really fast (less then a second).
I understand that the performance of fistp page load is limited since the query is really complicated (the SQL code dumped to DB is about 4k lines long). But the problem is that if I return to the page in an hour or so the query is slow again. It seems like the execution plan cache is cleared somehow? I've checked IIS settings and all application pool recycling setting are turned off.
Just to be clear, I'm not looking for methods to optimize my queries, I'm wondering why my query behaves strangely: first load slow, second load fast and load after one hour again slow.
Any ideas?
(dotPeek to the rescue. I couldn't find the class in the source on Codeplex, it may have been removed in v6.)
There's an internal class System.Data.Common.QueryCache.QueryCacheManager in EntityFramework.dll v5.0.0.0, which does what it says, but is a bit complex.
Here's what I'm pretty sure about: There is a timer which is started (if not already running) when a plan is added to the cache. The timer triggers a sweep of the cache every 60000 milliseconds (1 minute), and the cache is then actually swept if there are more than 800 plans cached. Plans which have not been re-used since the last sweep are evicted from the cache. If the cache has fewer than 800 plans in it, the sweep is skipped and the timer is stopped.
Here's what I'm not so sure about: There's part of the cache sweep I don't quite understand, but I assume it's clever. It looks like the algorithm makes it harder for a plan to stay in the cache the more sweeps it lives through, by bitwise shifting its hit count rightward by increasing amounts each sweep. On the first and second sweep it gets shifted 1, then 2, then 4, up to 16. I'm not sure what the reason for this is, and I'm having a hard time figuring out exactly how many times a plan needs to be used for it to stay in the cache more than 5 minutes. I'd appreciate it if anyone could give more information about 1) exactly what it's doing, and 2) what the rationale might be for doing this.
Anyway, that's why your plan isn't being cached forever.
I have a couch db application and for most of the views I notice that the time taken by the server to return a response varies from 10ms to 100ms. I do not have any concurrent write operations on the server and there are at the most 10 concurrent read requests.
How should I diagnose the problem ? Where you I look ?
I am running it on a rackspace cloud machine with 1GB RAM.
From the Couchdb Guide:
If you read carefully over the last few paragraphs, one part stands out: “When you query your view, CouchDB takes the source code and runs it for you on every document in the database.” If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document.
Most likely you are seeing the views be regenerated and recached.
I have a database table with N records, each of which needs to be refreshed every 4 hours. The "refresh" operation is pretty resource-intensive. I'd like to write a scheduled task that runs occasionally and refreshes them, while smoothing out the spikes of load.
The simplest task I started with is this (pseudocode):
every 10 minutes:
find all records that haven't been refreshed in 4 hours
for each record:
refresh it
set its last refresh time to now
(Technical detail: "refresh it" above is asynchronous; it just queues a task for a worker thread pool to pick up and execute.)
What this causes is a huge resource (CPU/IO) usage spike every 4 hours, with the machine idling the rest of the time. Since the machine also does other stuff, this is bad.
I'm trying to figure out a way to get these refreshes to be more or less evenly spaced out -- that is, I'd want around N/(10mins/4hours), that is N/24, of those records, to be refreshed on every run. Of course, it doesn't need to be exact.
Notes:
I'm fine with the algorithm taking time to start working (so say, for the first 24 hours there will be spikes but those will smooth out over time), as I only rarely expect to take the scheduler offline.
Records are constantly being added and removed by other threads, so so we can't assume anything about the value of N between iterations.
I'm fine with records being refreshed every 4 hours +/- 20 minutes.
Do a full refresh, to get all your timestamps in sync. From that point on, every 10 minutes, refresh the oldest N/24 records.
The load will be steady from the start, and after 24 runs (4 hours), all your records will be updating at 4-hour intervals (if N is fixed). Insertions will decrease refresh intervals; deletions may cause increases or decreases, depending on the deleted record's timestamp. But I suspect you'd need to be deleting quite a lot (like, 10% of your table at a time) before you start pushing anything outside your 40-minute window. To be on the safe side, you could do a few more than N/24 each run.
Each minute:
take all records older than 4:10 , refresh them
If the previous step did not find a lot of records:
Take some of the oldest records older than 3:40, refresh them.
This should eventually make the last update time more evenly spaced out. What "a lot" and "some" means You should decide Yourself (possibly based on N).
Give each record its own refreshing interval time, which is a random number between 3:40 and 4:20.
I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.