Elastic search replica selection mechanism - elasticsearch

Anyone has information regarding the replica selection mechanism that Elasticsearch uses? I mean the basis on which a particular replica of a shard is selected to serve a query. I have tried to look at Elastic search documentation but I was unable to find the information.
Kindly share any relevant resource.

So, I found the answer via discussion on ElasticSearch forums. To put it simply, ElasticSearch uses a round robin scheme to select replicas to respond to queries from within a replica group (sort-of). This replica group, according to my understanding, is selected based on the the awareness and preference attributes that has been provided in configuration.

Related

How does elastic search brings back a node which is down

I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.

Multi tenancy in Elastic Search

We are planning to introduce Elastic search(AWS) for our Multi tenancy application. We have below options,
Using One Index Per Tenant
Using One Type Per Tenant
All Tenants Share One Index with Custom routing
As per this blog https://www.elastic.co/blog/found-multi-tenancy the first option would give memory issue. But not clear about other options.
It seems if we are using the third option then there is no data segregation. Not sure about security.
I believe second option would be better option as data would be segregated.
Help me to identify best option to proceed elastic search with Multi tenancy.
Please note that we would leverage AWS infrastructure.
We are considering the same question right now, and the following set of articles by Elasticsearch was very helpful.
Start here: https://www.elastic.co/guide/en/elasticsearch/guide/current/scale.html
And read through each subsequent article until you hit this one: https://www.elastic.co/guide/en/elasticsearch/guide/current/finite-scale.html
The following two were very eye-opening for me:
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/one-big-user.html
The basic takeaway:
Alias per customer
Shard routing
Now you can have indexes for big customers, shared indexes for little customers, and they all appear to be separate indices
This is a too important link not to be mentioned here:
http://www.bigeng.io/elasticsearch-scaling-multitenant/
Good architecture dilemmas, and great performance analysis / reasoning.
tldr; they had index groups that are built around shard allocation filtering to segregate load across nodes in the cluster
To sum up accepted answer and other articles,
Use a shared index using custom routing using an alias
1.1) Special case: Big client can have dedicated index, only if needed.
Following article covers many use cases for detailed explanation.
https://www.elastic.co/blog/found-multi-tenancy
Following is the conclusion on how you can do it (link source: accepted answer)
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html

Is there a way to instruct Elasticsearch to only return matches from one node

We are designing a large framework around Elasticsearch and are investigating a few options.
For some complex analysis jobs, we are looking for a way to retrieve data from only the currently connected Elasticsearch node, i.e. only data from the primary shard on the node that I am connected to via the client or no result if there is no primary shard located on this node?
Is this possible via some search attribute or via more specialized setup?
We want to use the normal Elasticsearch functionality as much as possible, naturally, but sometimes there might be queries that need this type of access, is this doable with Elasticsearch?
You can restrict the search to specific shards using the preference query string parameter (see https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-preference.html).
e.g. by sending your query to http://ES-NODE:9200/INDEXNAME/_search?preference=_shards:1
you should be able to restrict the query to shard 1

Track what documents are coming from which shards in Elasticsearch

I have enabled routing and all my sets of documents are going to same shard. Now i need to directly hit that machines and see if there is performance gain . But then i haven't found a mechanism to find what document went to which shard. Kindly let me know if there is any way to achieve this.
You can use Search Shards API.
Sample Syntax:
GET /index/type/_search_shards?routing={routing_id}

Query and allocate data to shards based on tags

I'm running a typical logstash-redis-elasticsearch system to capture all my logs(around 500 GB/day). To my knowledge elasticsearch queries every shard in an index and aggregates the results, but due to the volume of logs per day and the response times needed, I want to query only few shards which of course should be decided on some "tag" in the message. So I'm looking at a way to allocate data to shards based on some tags and query only relevant shards based on the tags. Any leads, references or solutions on how to achieve this ?
I've already looked at shard allocation filtering but that doesn't cater this specific requirement.
Routing is the way to go here.
Specify a route option while indexing will cause the document to be routed on a specific shard. See routing in index API.
You can also extract the routing value from a field. See routing field.
Don't forget to search with the same routing value. See routing option in search.

Resources