Config number_of_shards and number_of_replicas in ELK - elasticsearch

I keep studying about ELK Stack and ran into a little problem.
I have been reading all the documentation possible and it makes great emphasis on the importance of shards and replicas.
But nowhere does it say how to configure the number of each one. I have read some site that says that it is better to leave it in automatic and others that say how to configure it in version 5.8 but that no longer works.
So if someone can explain to me I would be very grateful.

Just a small Add-on to #Val answer, related to primary shards.
Reason that you can't change the primary shards are due to the fact, it will change the way data is split between primary shards and changing them will cause consistent hashing to break, which is very popular technique to horizontally scale and splitting the data technique.
Replicas shards are just copy so you can increase and decrease as it doesn't make any impact on consistent hashing.
If you want to change the primary shards, you have to create a new index and use alias API and Reindex API for efficiently doing it,

When you create an index, you can configure both values in the settings of that index:
PUT your-index
{
"settings": {
"index.number_of_shards": 3,
"index.number_of_replicas": 1
}
}
Also note that you can update the settings of an index after its creation, but you can only update the number of replicas and not the number of primary shards:
PUT your-index/_settings
{
"settings": {
"index.number_of_replicas": 2
}
}
As simple as that!

Related

Redistributing aws elasticsearch primary shard content into two equal sized shard in same existing index

I need a way that whenever shard size increases from a given size limit, I need to redistribute that shard's data into two equal-size shards by adding one more shard and transfer half of the original size exceeded shard's data into newly created shard in the same index.
I have got the shard state like following, but need help find a way to distribute the data
{
"index": "public",
"shard": "0",
"store": "20GB"
}
P.S. I have tried Split Index API Link but this doesn't serve the purpose as it requires a new non-existing index and it cannot do the magic on the existing index, like in the above example index 'public' need to be the same but shard should increase and distribute data among themselves
This is not possible, you can't change the primary shards of elasticsearch index on the same index, this is because if your routing and location depend on the number of primary shards(created at the index creation time).
And if you change it, elasticsearch will have to change the routing algorithm and distribute the data again to evenly distribute the data in all the shards(including replica). Doing the above on a distributed large-scale stateful application is not an easy feat and elasticsearch as of now doesn't support it.
You cannot just add a shard without reindexing (but you can add a replica)
If part of your data is readonly, and you can activate a basic licence,(probably not in aws) you can define an ILM.
In Open Distro, you can use the equivalent :
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html

elastic search preference setting Custom Value(Java api)

I really really need some helps on elastic search usage in java api...
Let's assume I am using java api from ES.
So far, I understand that elastic search can give inconsistent result due to primary and replica's inconsistency issue(deleting doc makes stats difference in overall due to deletion marking instead of delete it).
So what I tried it
searchRequest.preference("_primary_first").
This gave me consistent result(since it only uses primary shard!)
Now what I want to try in my toy example is,
1) using preference=Custom (string) value
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-preference
2) if I have 5 nodes, I want to designate which node we want to use based on the queryText.
For instance,
'''
if (queryRequest.text().equals("red")) {
// use 1st node
searchRequest.preference("??????")
} else if (queryRequest.text().equals("blue")) {
// use 2nd node
searchRequest.preference("??????")
} else {
// use either 3rd~5th node <- but this is not necessary if it is really hard..
searchRequest.prefernce("???????")
}
'''
Q1)
I guess I need to use custom setting "WISELY" to denote which node to use...
can someone give me simple java.api example?
Q2)
This is another one, but is there any way we can load status for each node from searchResponse?(again in java api friendly)
Q3)
Is there any clever way to specify to use 1st Node(or certain Node Id??) with given query text?(instead using hashmap things...)
For instance,
let say I don't know which query text I will receive, but I want to evenly distribute them to each node(among 5!)
But want to stick with the first choice.
if I see very first query text == "red" and I designate this queryRequest to use Node1, then later I also want to use Node1 if I see the query text == "red" again.. Does someone have idea?
Thank you guys!
Disclaimer:
I am non-CS guy and independant learner who tried to experiment new things to break my comfort zones! :) Please excuse this silly question!
Actually it's not a silly question and the answer has two parts.
You mention nodes and you want to control which node gets what queries based on an attribute.
Some context:
An elasticsearch cluster has elasticsearch nodes
Your documents will be "saved" in an elasticsearch index and the queries you perform will be against that index
An elasticsearch index is but an abstraction, a layer that hides the complexity of shards (basically lucene indices).
Now when you save a document, that document will eventually be stored in a shard (there are segments etc, but no reason to go any further). Now you can have primary shards and replica shards. When you save something, that will go to a primary shard and will be replicated by elasticsearch to the replica shards (if any). Your searches can and will be served both by primary and replica shards.
Now, you want to control which node gets what. What you can control is which shard gets what via routing on save and via routing on search.
You've asked to control which node get's what. Most of the times you won't be needing this. What you can control is what shard gets what, so you'll need to control which node gets what shard. This can be accomplished via shard allocation awareness.
Both of these topics are advanced features and you'll need to make sure to know what you are doing when trying to use them or you'll get very unexpected results.

ElasticSearch : More indices vs More types

We are using elasticsearch for the following usecase.
Elasticsearch Version : 5.1.1
Note: We are using AWS managed ElasticSearch
We have a multi-tenanted system where in each tenant stores data for multiple things and number of tenants will increase day by day.
exa: Each tenant will have following information.
1] tickets
2] sw_inventory
3] hw_inventory
Current indexing stratergy is as follows:
indexname:
tenant_id (GUID) exa: tenant_xx1234xx-5b6x-4982-889a-667a758499c8
types:
1] tickets
2] sw_inventory
3] hw_inventory
Issues we are facing:
1] Conflicts for mappings of common fields exa: (id,name,userId) in types ( tickets,sw_inventory,hw_inventory )
2] As the number of tenants are increasing number of indices can reach upto 1000 or 2000 also.
Will it be a good idea if we reverse the indexing stratergy ?
exa:
index names :
1] tickets
2] sw_inventory
3] hw_inventory
types:
tenant_tenant_id1
tenant_tenant_id2
tenant_tenant_id3
tenant_tenant_id4
So there will be only 3 huge indices with N number of types as tenants.
So the question in this case is which solution is better?
1] Many small indices and 3 types
OR
2] 3 huge indices and many types
Regards
I suggest a different approach: https://www.elastic.co/guide/en/elasticsearch/guide/master/faking-it.html
Meaning custom routing where each document has a tenant_id or similar (something that is unique to each tenant) and use that both for routing and for defining an alias for each tenant. Then, when querying documents only for a specific tenant, you use the alias.
You are going to use one index and one type this way. Depending on the size of the index, you consider the existing index size and number of nodes and try to come up with a number of shards in such way that they are split evenly more or less on all data holding nodes and, also, following your tests the performance is acceptable. IF, in the future, the index grows too large and shards become too large to keep the same performance, consider creating a new index with more primary shards and reindex everything in that new one. It's not an approach unheard of or not used or not recommended.
1000-2000 aliases is nothing in terms of capability of being handled. If you have close to 10 nodes, or more than 10, I also do recommend dedicated master nodes with something like 4-6GB heap size and at least 4CPU cores.
Neither approach would work. As others have mentioned, both approaches cost performance and would prevent you from upgrading.
Consider having one index and type for each set of data, e.g. sw_inventory and then having a field within the mapping that differentiates between each tenant. You can then utilize document level security in a security plugin like X-Pack or Search Guard to prevent one tenant from seeing another's records (if required).
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type which means that doc_type (_type) is deprecated.
Full explanation you can find here but in summary there are two solutions:
Index per document type
This approach has two benefits:
Data is more likely to be dense and so benefit from compression techniques used in Lucene.
The term statistics used for scoring in full text search are more likely to be accurate because all documents in the same index represent a single entity.
Custom type field
Of course, there is a limit to how many primary shards can exist in a cluster so you may not want to waste an entire shard for a collection of only a few thousand documents. In this case, you can implement your own custom type field which will work in a similar way to the old _type.
PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"type": { "type": "keyword" },
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" },
"content": { "type": "text" },
"tweeted_at": { "type": "date" }
}
}
}
}
You use older version of Elastic but the same logic can apply and it would be easer for you to move to newer version when you decide to do that so I think that you should go with separate index structure or in other words 3 huge indices and many types but types as a field in mapping not as _type.
I think both strategies have pros and cons:
Multiple Indexes:
Pros:
- Tenant data is isolated from the others and no query would return results from more than one.
- If total of documents is a very big number, different smaller indexes could give a better performance
Cons: Harder to manage. If each index has few documents you may be wasting a lot of resources.
EDITED: Avoid multiple types in the same index as per comments o performance and deprecation of the feature

Duplicate documents in Elasticsearch index with the same _uid

We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id, _type and _uid fields.
A GET request to /index-name/document-type/document-id just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:
POST /index-name/document-type/_search
{
"filter": {
"term": {
"_id": "document-id"
}
}
}
Aggregating on the _uid field also identifies the duplicate documents:
POST /index-name/_search
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "_uid",
"min_doc_count": 2
}
}
}
}
The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.
Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.
We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing field.
We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).
We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.
We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.
The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).
Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?
We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.
A GET request to /index-name/_settings revealed this:
"version": {
"created": "1070599",
"upgraded": "2040699"
},
"legacy": {
"routing": {
"use_type": "false",
"hash": {
"type": "org.elasticsearch.cluster.routing.DjbHashFunction"
}
}
}
"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.
It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383
We think this is what happened to trigger the change:
Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.
We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version setting which you should not set in ES 2.4.
While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version setting, it caused Elasticsearch to think that an upgrade was in processed.
ES automatically set the legacy.routing.hash.type setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction instead of the default Murmur3HashFunction which had been used to route the data originally.
This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:
"version": {
"created": "2040699"
}

Elastic search preference set to custom value, document still returned from different shards

I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.

Resources