Elasticsearch : What is the effect disabling replication and balancing - elasticsearch

If I have an ES cluster and an application indexing data into ES.
EDIT: The application creates indices in a dynamic way based on some business rules.
For example, if the application listen to tweets from Twitter API based on some hashtags it creates an index in ES for each hashtag.
This way, each time a new hashtag comes, a new index is created in ES.
Sometimes, shard reallocation happen and at this stage, the cluster behaves poorly as the amount of data moved between nodes is huge.
From ES cluster API, we can disable shard reallocation and balancing.
What will be the effects (positive and negative) of disabling the reallocation and balancing?

This sounds like a quite unorthodox way of organizing documents in Elasticsearch, wouldn't it be simpler to have a string not_analyzed field which would be an array of hashtabs (as a single tweet can have zero, one two or more hashtags).
If there was only one hashtag / tweet you could use it for routing them to a specific shard, if search performance is a concern for you.
Anyway, if you disable shard balancing then some machines would have increasingly disproportionate amount of documents on some machines and too few on others, this could hamper indexing and searching performance.
Also if you don't have any replicas of shards then in the event of a node shutdown part of you data would become inaccessible. I'm sure in the long run there are other downsides as well.

Related

What are the general guidelines for Elasticsearch cluster configuration for instance size, data nodes and sharding?

We regularly encounter several issues that crop up from time to time with Elasticsearch. They seem to be as follows:
Out of disk space
Slow query evaluation time
Slow/throttled data write times
Timeouts on queries
There are various areas of an Elasticsearch cluster that can be configured:
Cluster disk space
Instance type/size
Num data nodes
Sharding
It can sometimes be confusing which areas of the cluster you should be tuning depending on the problems outlined above.
Increasing the ES cluster total disk space is easy enough. Boosting the ES instance type seems to help when we experience slow data write times and slow query response times. Implementing sharding seems to be best when one particular ES index is extremely large. But it's never quite clear when we should boost the number of data nodes vs boosting the instance size.

How many indexes can I create in elastic search?

I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .

Overhead of empty elastic search indices on performance

We use Elastic search for full text search use cases. The data is metadata collected across different objects and stored as ES document. We also update the document in ES whenever the master data gets updated. So, basically it is not a logging use case.
We create one ES index (one primary and 1 replica shard) as soon as we have a tenant who gets onboard for our application. This is to ensure that the ES index is ready when the first object gets created.
We do not anticipate volume of data in the index. The data could range between few hundred of MBs per index. So this is a relatively empty index.
Also, full text search is an optional add-in feature in application, so not all tenants may opt for the same, however our technical team suggested creating index upfront.
What is the overhead of such indices on the ES performance? Are we doing anything different from best practices of ES?
Any input is appreciated.
Empty Elasticsearch index don't have much overhead, as there is actually no data in them, only places where empty indices data is present in the cluster state(index mapping, setting etc) which every node in the cluster has and any change in the index mapping or settings ie index metadata updates the cluster state and gets updated on all the nodes in ES cluster.
If you have sufficient memory and ES heap size, you don't have to worry at all about these empty indices which IMO makes sense considering your use-case.

Elasticsearch maximum index count limit

Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue

How to segregate Elasticsearch index and search path as much as possible

I am planning to segregate Elasticsearch index and search requests as much as possible to avoid any unnecessary delay in the indexing process. There is no such a thing as an Elasticsearch dedicated search node or index node. However, I was wondering if the following scenario is suitable. As far as I understood, I cannot segregate search requests from index requests completely because at the end both hit ES data nodes, but it is what I think can help a little:
Few Elasticsearch Coordinator nodes (No master/data) to deal with search requests and route them to the corresponding data node. Hence, for creating search client to deal with search requests, coordinator node URL will be used only.
Use Elasticsearch data nodes directly for the index path and ignore coordinator nodes for indexing.
In this case, the receiving data node will act as a coordinator node for indexing path and dedicated coordinator nodes will be used to route to a replica on data nodes. Data node unnecessary load due to search routing can be minimised.
I was wondering if there is another way to provide segregation at a higher level or I am insane to not use coordinator nodes for the indexing path as well.
P.S: My use case is heavy indexing and light/medium search
You cant separate indexing and search operations, indexing will write on the primary shard, then on the replica shard, whereas search can be done only on primary shards.
If you care about write performance:
no replica
refresh_interval > 30s, keep analyzer simple
lot of shards (across data nodes)
send insert/update queries on data nodes directly
try to have a hot/cold data architecture (hot/cold indices)
Coordinator nodes can not improve search performance at all, this depends on your workload (aggs etc...).
As usually, all tuning stuff depend on your data and usage, you must find the good balance between indexation and searching performance, use the _node/stats endpoint to see whats going on.

Resources