Elasticsearch - queries throttle cpu - windows

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)

Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

Related

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources
First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

Elasticsearch cache clear doesn't seems to do what I expected

I ran a search and the first time and it took 3-4 seconds.
I ran the same search second time and it took less than 100 ms (as expected as it used the cache)
Then I cleared the cache by calling "http://host:port/index/_cache/clear"
Next I ran the same search and was expecting it to take 3-4 seconds but it took less than 100 ms
So the clearing of the cache didn't work?
What exactly got cleared by that url?
How I do make ES do the raw search (i.e. no caching) every time?
I am doing as a part of some load testing.
Clearing the cache will empty:
Field data (used by facets, sorting, geo, etc)
Filter cache
parent/child cache
Bloom filters for posting lists
The effect you are seeing is probably due to the OS file system cache. Elasticsearch and Lucene leverage the OS file system cache heavily due to the immutable nature of lucene segments. This means that small indices tend to be cached entirely in memory by your OS and become diskless.
As an aside, it doesn't really make sense to benchmark Elasticsearch in a "cacheless" state. It is designed and built to operate in a cached environment - much of the performance that Elasticsearch is known for is due to it's excellent use of caching.
To be completely accurate, your benchmark should really be looking at a system that has fully warmed the JVM (to properly size the new-eden space, optimize JIT output, etc) and using real, production-like data to simulate "real world" cache filling and eviction on both the ES and OS levels.
Synthetic tests such as "no-cache environment" make little sense.
I don't know if this is what you're experiencing, but the cache isn't cleared immediately when you call clear cache. It is scheduled to be deleted in the next 60 seconds.
source: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-clearcache.html

Ways to improve first time indexing in ElasticSearch

In my application, I have a need to re-index all of the data from time to time. I have noticed that the time it takes to index data the first time (via bulk index) is significantly slower than subsequent re-indexing. In one scenario, it takes about 2 hours to perform the indexing the first time, and about 15 minutes (indexing the same data) with subsequent indexing.
While the 2 hours to index the first time is reasonable, I am curious why subsequent iterations to re-index are significantly faster. And more so, I am wondering if there's anything I can do to improve the performance for when indexing the first time, e.g. perhaps by indicating how large the index will be, etc.
Thanks,
Eric
Have you defined a mapping for your types? If not, everytime ES find a new field, the mapping must be updated (and this impact the whole index).
On subsequent indexing, the mapping is already complete. So what you could do is explicitly mapping your types.
Also, you can improve speed of re-indexing by setting the refresh_interval to an higher value, look at this benchmark.
Edited to strike out references to merge_factor as it has been removed in ES 2.0: https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_setting_changes.html#_merge_and_merge_throttling_settings
As Damien indicates, you can indeed influence (bulk) indexing settings - refresh_interval can be set to -1 temporarily and set back to the default value of 1s after you complete your bulk indexing. Another setting to modify is the merge.policy.merge_factor; set it to a higher value such as 30 and then back to the default of 10 once done.
There are a number of tutorials and mailing list discussions about optimizing bulk indexing, but here's some official doc links to start with:
http://www.elasticsearch.org/guide/reference/index-modules/merge/
http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings/
If you haven't already tuned the memory settings for your JVM, you should. Although specific to a 512mb VPS running Ubuntu 10.04 server, these settings (http://pastebin.com/mNUGQCLY) should point you in the right direction. Basically, allocating the desired amount of RAM to Elasticsearch upon startup can improve JVM memory allocation/GC timing.

How do I modify an existing MongoDB index?

I have a series of indexes in my MongoDB and I think that one of the reasons I'm running the system at such a high CPU is that updating the indexes is blocking. (AWS micro instance running at 50%+ CPU during normal operation, 99.9% during heavy write operations).
I've got a good handful of indexes in place for fast queries, and now I'm thinking that I might be able to show some further improvements by moving index-building into a background operation.
I don't want to delete the indexes entirely (at least I don't think I do) instead I'd be happy if just "future operations" ran in the background.
I've looked at the mongo index building documentation http://www.mongodb.org/display/DOCS/Indexes and see the flags for turning on background operation (see below) but I don't see anything about how to modify an existing index.
db.things.ensureIndex({x:1}, {background:true});
> db.things.ensureIndex({name:1}, {background:true, unique:true,
... dropDups:true});
I tried running the original index command a second time with the updated parameter, but when I re-execute the db.colection.getIndexes(); comant, it does not show the 'background' parameter in the output.
{
"_id" : ObjectId("4de2c1a5c9907a4e77467826"),
"ns" : "mydb.items",
"key" : {
"itemA" : 1,
"itemB" : -1,
"itemC" : -1
},
"name" : "itemA_1_itemB_-1_itemC_-1",
"v" : 0
}
do I have to drop the index and start again ? or is the parameter of index-in-background simply not shown?
#TTT from comments: You want async updating of the index after you inserted, updated or deleted a document?
I think so. I'd like the system's query speed to be maxed at all times, so I definitely need indexes, but there are also times when I'm running tens of thousands of inserts. This becomes a problem for the whole system as Mongo's CPU usage shoots up to 99.9% for the duration of the update.
I think that updating indexes in the background is the answer, but I was reading in the MONGO docs that if an index is being recalculated, it isn't used in querying until the recalc is done.
My ideal "dream situation" is that the system would use the "last best index" until the background update process was complete (or even just always using the current best known index).
Your index system as it is sounds pretty sane. The issue you are seeing looks to be a symptom of the server setup you are using.
EC2 Micro instances are notoriously poor performing when it comes to any sort of sustained operation. Serving web pages is fine, but any prolonged CPU usage will see deteriorating output over time. This is a result of the sustained CPU available being a lot smaller than the burst CPU available for shorter operations.
From Amazon's EC2 page:
Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.
And:
Up to 2 EC2 Compute Units (for short periodic bursts)
I'd recommend moving to a setup with more available CPU, such as a small instance, or if cost is one of your primary concerns, a Linode VPS.

How to index records in Solr faster (and not impact ColdFusion web server)? Two JVM?

I have a 64 bit server, 8 GB RAM, dual quad CPU. No resources are ever hitting 100% (except, I guess, the JVM -- right?).
I need to index several million records for Solr, but the machine is in production. I recognize having a second machine for indexing would be helpful.
Should I dedicate a second instance of the JVM, dedicated to Solr?
Right now, when I run an index, pages which are normally served in 200 milliseconds will serve up in about 1.5 seconds, sometimes more... hitting, even, the dreaded "Service is Unavailable" error.
I adjusted my JVM Heap as follows:
-Xmx1024m
-XX:MaxPermSize256m
In case I'm chasing the wrong solution, allow me to broaden the landscape a bit. It seems that I can't affect the indexing speed of Solr. I had previously been indexing about 150,000 records per hour on a dev server virtualized on a workstation. In a production environment with much more hardware available, I'm indexing at the exact same speed.
Without data to prove it, I think that my JVM adjustments did not speed up the indexing, although it may have allowed the CF server to continue serving pages. I must say, the indexing speed is terribly slow, but I do know that it's not a function of the data access layer. I rewrote it from pure ORM to objects backed by SQL Stored Procedures thinking that was the slowdown (no effect).
use a separate instance for indexing the index, the only trick is getting the running searching instance to re-read the updated index, in which case, you set up a master (the indexer) and slave(the searcher) and do replication. this will both make the searcher not get interrupted, and the indexer will utilize its own JVM including its own share of the resources.
Have you tried these optimization tips?
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-1.html
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-2.html
http://bytestopshere.com/post.cfm/lessons-learned-moving-from-verity-to-solr-part-1

Resources