Elasticsearch with different java heaps, does it matter? - performance

Say, I've got 2 servers. One of which has -xmx and -xms set to 4G and one to 2G.
Will ElasticSearch handle those performance differences in the balancing mode? Or will both the servers be called purely based on indices, resulting in a (much more) likely OOM for the latter than the former?
By the way, I've set the properties indices.fielddata.cache.size, indices.breaker.fielddata.limit, indices.breaker.request.limit, and indices.breaker.total.limit on both servers as ElasticSearch is suggesting
This is important, to me, because if it does, I'd have to change the index sharding on guessed index strain, which will be a hassle (if not impossible)

Elasticsearch treats every nodes as the same and equally balances the documents between them. This means that Elasticsearch wont readjust based on hardware and get you the optimal performance.
One thing to remember here is that a herd of bulls is only as fast as its slowest bull. The same gets applied here. But then if the load is small enough that it does not eat up all the hardware for 2 GB machine ,then we should not be seeing any issue. Otherwise you should see difference in memory aggressive operations like aggregations.

Related

Reason behind “Index creation no longer defaults to five shard”

What was the reasoning behind ""Index creation no longer defaults to five shard but one shard"
So far, the assumption was , more shards = more scalability = more parallelism
Isnt that change defeating the whole purpose of distributed systems like ES ?
Yeah, you can relate to more shards= more scalability = more parallelism but this only happens when this is only useful when these shards utilize the multi-cores or more machines(data-nodes) in the cluster.
This is the default config, which is created for the basic workloads and obviously needs more fine-tuning for the advance use cases, which is the sole purpose of making it extensible, it's very difficult to design the perfect Elasticsearch cluster and as it depends on various factors, Elasticsearch tends to provides some default values which works more for general use-cases.
Either you start with a modest workload and then gradually your workload tends to increase, or you start with the huge workload in the begining itself(in which case, any way you will have more shards to get the benefit listed in the first line and this is for advanced use-case).
But first use is more common and the beauty of Elasticsearch is that with little knowledge you can get started and these default settings work quite well for modest workload and oftentimes you don't have to change them and even don't have to understand them in details.
Having more number of shards for a small number of documents with huge search traffic created issues(creation of 5 threads for a single search as default shards were 5) and this is the common use for most of the basic and modest applications out there.
So it makes sense to change the default shards to 1 as its more common use-case and beyond that any way you need to go deep to scale your cluster which would require fine-tuning Elasticsearch further.

Can Circuit break exception be avoided using horizontal scaling?

I am using crate 1.0.2 which internally uses elasticsearch. So my question is applicable for both. For certain queries I get circuit break exception.
CircuitBreakingException: [parent] Data too large, data for [collect: 0] would be larger than limit of [11946544332/11.1gb]
These queries are mainly group by on multiple columns. I have billions of documents indexed and have 16 GB of RAM allocated as crate heap size. I have multiple such nodes connected together in a cluster. Will adding more nodes in the cluster help in getting rid of this error and will my same queries run fine ? Or is it that I must increase heap to 30 GB ? My worry is when I increase it to 30GB and as I add more data, someday that query will again hit the circuit breaker. So I wanted to solve it by scaling horizontally i.e. adding more nodes. Will that be wiser decision ?
Short answer: Usually horizontal scaling helps.
Your error seems to be caused by group by queries.
The GROUP BY operations are executed in a distributed fashion. So more nodes
will generally split the load and therefore also the memory usage. (Make sure
there are enough shards so that they're spread among all nodes)
There is a catch though: Eventually the data needs to be merged together on the
node you sent the initial query to. This is generally fine because the data
arrives pre-aggregated, but If the cardinality is too high (Ex. GROUP BY on a
primary key), the whole data set has to fit into memory on this coordinator
node.
If your nodes have enough memory to go up to 30 GB (with still having enough to
spare for the file system cache), I'd personally tend to increase the HEAP size
first, before adding new nodes.
Update:
Recent versions (2.1.X) also contain some fixes regarding the circuit-breaker behaviur. So if it's possible to update that'd be recommended as well.
Update2:
Note that there are different cases in which a circuit breaker can trip. In
your case it's caused by a GROUP BY using up quite a lot of memory. But it can
also be tripped if a single request is too large. For example if the bulk size
is too large. In such a case more nodes wouldn't help. You'd have to reduce the
bulk size.

Elasticsearch shard allocation for small indices

I have an elasticsearch setup with 192 active indices ranging from a few hundred mb to possibly 5gb each. I read that for a logstash use case with 1gb indices you should only use 1 shard. The difference with my setup is that I will be having more users (estimate of up to 100) expecting a quick response time. I intend to have 1 replica for reliability.
Will having 1 shard per index still be appropriate for my use case?
In a word: yes.
The need to create multiple primary shards derives from the need to isolate documents, extreme counts (e.g., when you're in the billions of documents volume), or to improve write throughput (write documents across more places, thereby reducing individual burden).
In practice, you want to shard based on your use case, unless you're one of those first two scenarios (isolation or extreme counts).
Are you read heavy?
Are you write heavy? (Less common, but it does happen)
If you're read heavy, as most use cases are, then having fewer shards will help you by limiting the request size (fewer places to look). Given that your shard sizes are also relatively small (I'd consider anything under 5 GB to be relatively small), you can easily get away with having a single primary shard and it should benefit your search performance by doing so.
Indexes that share the same mappings, but are also tiny ("few hundred MBs"), should likely be combined if you search across them. If they're independent, then it really makes no difference and the isolation sounds like good practice at the expense of slightly bloating your cluster state (with each index).
Have a look at this blog: https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index. He has a lot of good pointers to sharding and shard sizing.
However, the question you really should be asking yourself is: How easy is it to change? When it comes to sizing and scalability, the answer often is "it depends" - and the real question is: How quickly can you reconfigure?
This could e.g. mean that you design you application in a way, that allows quick re-spooling of data into a new index, that you use aliases so that you can in fact change these things, where your data lies (not just in Elastic, I hope) etc.
By building a system - from the start - so that you can quickly rebuild indicies enables you to experiment with sizes - and more importantly - change them as your need changes.

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources
First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

Elasticsearch fuzzy matching optimization for huge server/server cluster

I've got an index with quite complex queries running on it. The main slowdown are the fuzzy queries which are run against a field containing 2-5 words for each record. I mainly have to find rows with 1-3 differing characters.
On my 4 core (with HT) and 8GB ram machine the my queries are executed in about 1-2s each.
On a server with 12 cores (with HT) and 72Gb RAM the query executes in 0.3-0.5 seconds. This doesn't seem to me as a reasonable scaling on the hardware provided. I'm sure there should be some hidden options for me to tune to adjust the query performance.
I've looked through the elastic search guide but couldn't find there anything which would help me in tuning the performance based on the number of CPUs or RAM or tuning elastic specifically for fuzzy queries.
another question is how does it scale if i add another server like this? will the query time be roughly twice smaller?
There is a couple of possibilities here. First is that your query is I/O bound. In this case, just adding another server might help because two nodes will be retrieving data from two disks. Another possibility is that your query is CPU bound. To a large degree, search against a single shard is a single-threaded process. Assuming that your index was created with default settings, it has 5 shards. So, your query cannot significantly benefit from running on more than 5 CPUs. In this case, adding another node would only slow things down because of network overhead. Instead, you need to recreate index with more shards.

Resources