Dedicated Nodes in Elasticsearch for Special Client? - elasticsearch

In our application we have an elasticsearch cluster that is shared for all of our clients.
The types of requests made against this cluster are computationally intense and may take minutes to complete. Because of the type of data in the cluster, the types of requests we get, and the irregularity at which they are used, we can't predict when these requests will be made or do any caching before hand. If multiple clients make requests at the same time, they will experience slower response speeds than normal.
For most clients this isn't an issue. Their data isn't large enough to make a noticeable difference (3s -> 10s occasionally isn't a big deal). But for larger clients, the time difference may be minutes and is very noticeable.
What we'd like, above all else, is consistency- even if these operations were slower on average. To do that, we'd like to give these special clients dedicated nodes, while all other clients use the shared nodes. At a first glance, it seems the only way to do that is to create a dedicated cluster.
But this adds overhead in any application that interfaces with elasticsearch to first do a lookup for the cluster to route to. Ideally, we'd be able to create dedicated nodes within the cluster. This way, the application doesn't need to be aware of cluster routing, and the index configuration can be shared between these "virtual clusters".
Each document has a client id, which we can use to distribute across nodes using _routing. But this has its own problems. Firstly, this doesn't allow us to create a generic default cluster. Secondly, this may mean the dedicated nodes share client data with other clients- the goal is to get consistent speeds by removing node resource contention. And finally, this doesn't allow us to allocate multiple nodes for a given route.
Is there a way to create a routing rule that nodes can be added to explicitly. e.g. I want to add 3 nodes to routing key 582123.
Is there a way to create a default routing rule for nodes that don't match existing routes? If not we could always have an explicit default route. We'll still need to do a route lookup application side but it would still reduce complexity in a multi-cluster scenario.

Depending on the version of elasticsearch you are using and the setup of your indices, you could use per-index allocation. Basically you give a node an attribute and then you specify on index level settings where that index (more accurately, its shards) should end up. As you will read in the documentation, you need to make sure that other constraints are not being violated e.g.
Shards are only relocated if it is possible to do so without breaking another routing constraint, such as never allocating a primary and replica shard on the same node.
This means that you need to have different indices though.

Related

Reason behind “Index creation no longer defaults to five shard”

What was the reasoning behind ""Index creation no longer defaults to five shard but one shard"
So far, the assumption was , more shards = more scalability = more parallelism
Isnt that change defeating the whole purpose of distributed systems like ES ?
Yeah, you can relate to more shards= more scalability = more parallelism but this only happens when this is only useful when these shards utilize the multi-cores or more machines(data-nodes) in the cluster.
This is the default config, which is created for the basic workloads and obviously needs more fine-tuning for the advance use cases, which is the sole purpose of making it extensible, it's very difficult to design the perfect Elasticsearch cluster and as it depends on various factors, Elasticsearch tends to provides some default values which works more for general use-cases.
Either you start with a modest workload and then gradually your workload tends to increase, or you start with the huge workload in the begining itself(in which case, any way you will have more shards to get the benefit listed in the first line and this is for advanced use-case).
But first use is more common and the beauty of Elasticsearch is that with little knowledge you can get started and these default settings work quite well for modest workload and oftentimes you don't have to change them and even don't have to understand them in details.
Having more number of shards for a small number of documents with huge search traffic created issues(creation of 5 threads for a single search as default shards were 5) and this is the common use for most of the basic and modest applications out there.
So it makes sense to change the default shards to 1 as its more common use-case and beyond that any way you need to go deep to scale your cluster which would require fine-tuning Elasticsearch further.

ElasticSearch Analytical queries

I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.
For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.
Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.
With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.
In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.
I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.
I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.
Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.
Google Analytics Pros:
Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection
Google Analytics Cons:
Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues
ElasticSearch Pros:
Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying
ElasticSearch Cons:
Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system

Infinispan vs memcached for high concurrency need

My web application maintains in memory cache of domain entities which are read/written at high frequency. To make application clustered, i need to synchronize / externalize this cache.
Which will be better option amongst memcached and infinispan considering following application facts-
cache will be read/written at high frequency per second
if infinispan, data need to replicated across nodes near- real time
high concurrent write should not create conflicts issue if replication is slow.
I feel memcached will solve this purpose well since it's centralized and does not need replication delay like infinispan. Can experts provide opinion on this?
Unfortunately I'm not a Memcached expert but let me tell you more about some fundamental concepts so that you could pick the best option for your use case...
First, centralized vs decentralized - if you have only one node in your system, it will be faster (as you said there is no replication). However what will happen if the node is down? Or another scenario - what will happen if the node gets full (as you said you will perform a lot of read/writes per second)? One solution for that is to use master/slave replication where writes are propagated to the slave node asynchronously. This solution will save you in case the node is down but won't do any good if the node is full (if master node is full, slave will get full a couple of minutes later).
Data consistency - if you have more than 1 node in your system, your data might get out of sync. Imagine asynchronous replication between 2 nodes and a client connected to each of them. Both clients perform a write to the same key at the same exact moment. It might seems unlikely but believe me, with highly concurrent reads and writes it will happen. The only way to solve this problem is to use synchronous replication with majority of nodes up and running (or with so called consensus).
Back to your scenario - if a broken node is not a problem for you (for example, you can switch to some other data source automatically) and your data won't grow - go ahead for 1 node solution or master/slave replication. If your data need to be strongly consistent - make sure you're doing sync replication (and possibly with transactions but you need to refer to the user manual for guidance). Otherwise I would recommend picking a more versatile solution which will allow you to add/remove nodes without taking down whole system and will have an option for sync/async replication.
From my experience, people care too much about data consistency whereas should care much more about scalability. And a final piece of advice - please define your performance criteria before evaluating any solution (something like, my writes need to take no longer than X and reads no longer than Y. Define also confidence level for your criteria (I need 99.5% of all reads to be less than X).

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources
First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

Bulk insert vs Single insert

The primary dev managing our ES cluster has made the statement that single document loads to ES will only provide us with roughly 30 / 40 creations a second. Whereas the bulk operations will give us more in the range of a 1,000+. I realize that bulk is always faster (or is generally) and there are hardware / environment constraints to any process. However, with other technologies you do not pay such a heavy price for single insertions. I am obviously ignorant when it comes to ES. Why do you pay such a heavy price for document writes in ES? Or are we just not properly informed?
Environment:
Apache Storm writes to our ES cluster
Currently all of the writes are processed in bulk operations.
What you have to take into account is the round trip time between your loader and your cluster. Setting up an http connection, transferring the data, and then waiting for a response can take a while -- in this case it seems it's taking your about 30 ms. Elasticsearch has to setup a parser for your request, hand it off to the node that is really going to do the work, and then generate the response back to you.
By using the bulk API, you remove a lot of back and forth -- ES can group together inserts going to the same node, doesn't have to instantiate a new parser for every request, etc.
HTTP Connection pooling for single requests would help, but doing bulk inserts/updates/deletes is always going to be faster in the long run.
Bulk indexing is indeed way faster but it is not as bad as you system admin suggests. Elasticsearch has gotten a lot better at this stuff over the past two years.
We're able to do hundreds of inserts/updates per second without bulking requests. Most inserts take around 1 ms (including sending the http request and receiving the response). If insert speed becomes an issue, you can back off on the cluster refresh (default 1s). Also, you can use multiple threads to insert. Bulk insert can get in the range of 10000s, depending on how complex your mappings are.
You definitely want http connection pooling (true when using any kind of webservice in anger) or even better, run an embedded elasticsearch node. Another alternative is to run an elasticsearch node on localhost if you don't want to do an embedded node. That way, all http traffic is on localhost.
Finally, if you need to support more concurrent writes, you can always increase the number of shards and nodes. These numbers are not set in stone. If you need tens of thousands of writes per second, it should be possible to engineer a cluster that can do it. It will require a lot of tuning and hardware probably, and you should probably not do this unless you have a really good reason to do so. However, the whole point of elastic search is horizontal scalability.

Resources