I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.
Related
I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.
I really really need some helps on elastic search usage in java api...
Let's assume I am using java api from ES.
So far, I understand that elastic search can give inconsistent result due to primary and replica's inconsistency issue(deleting doc makes stats difference in overall due to deletion marking instead of delete it).
So what I tried it
searchRequest.preference("_primary_first").
This gave me consistent result(since it only uses primary shard!)
Now what I want to try in my toy example is,
1) using preference=Custom (string) value
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-preference
2) if I have 5 nodes, I want to designate which node we want to use based on the queryText.
For instance,
'''
if (queryRequest.text().equals("red")) {
// use 1st node
searchRequest.preference("??????")
} else if (queryRequest.text().equals("blue")) {
// use 2nd node
searchRequest.preference("??????")
} else {
// use either 3rd~5th node <- but this is not necessary if it is really hard..
searchRequest.prefernce("???????")
}
'''
Q1)
I guess I need to use custom setting "WISELY" to denote which node to use...
can someone give me simple java.api example?
Q2)
This is another one, but is there any way we can load status for each node from searchResponse?(again in java api friendly)
Q3)
Is there any clever way to specify to use 1st Node(or certain Node Id??) with given query text?(instead using hashmap things...)
For instance,
let say I don't know which query text I will receive, but I want to evenly distribute them to each node(among 5!)
But want to stick with the first choice.
if I see very first query text == "red" and I designate this queryRequest to use Node1, then later I also want to use Node1 if I see the query text == "red" again.. Does someone have idea?
Thank you guys!
Disclaimer:
I am non-CS guy and independant learner who tried to experiment new things to break my comfort zones! :) Please excuse this silly question!
Actually it's not a silly question and the answer has two parts.
You mention nodes and you want to control which node gets what queries based on an attribute.
Some context:
An elasticsearch cluster has elasticsearch nodes
Your documents will be "saved" in an elasticsearch index and the queries you perform will be against that index
An elasticsearch index is but an abstraction, a layer that hides the complexity of shards (basically lucene indices).
Now when you save a document, that document will eventually be stored in a shard (there are segments etc, but no reason to go any further). Now you can have primary shards and replica shards. When you save something, that will go to a primary shard and will be replicated by elasticsearch to the replica shards (if any). Your searches can and will be served both by primary and replica shards.
Now, you want to control which node gets what. What you can control is which shard gets what via routing on save and via routing on search.
You've asked to control which node get's what. Most of the times you won't be needing this. What you can control is what shard gets what, so you'll need to control which node gets what shard. This can be accomplished via shard allocation awareness.
Both of these topics are advanced features and you'll need to make sure to know what you are doing when trying to use them or you'll get very unexpected results.
I have a question with Elasticsearch as below:
In mode replication is async, while a document is being indexed, the document will already be present on the primary shard but not yet copied to the replica shards. At this time, a GET request for this document is forwarded to a replica shard.
How does Elasticsearch handle this in the cases the document is not yet indexed on the replica shards or the document is not yet updated on the replica shards ?
A fail request will be returned in the case indexing a new document or a old document returned in the case updating a document? Or requesting node will re-forward to the primary shard to get data?
First of all, I wouldn't recommend using the async mode. It doesn't really provides any benefits that you couldn't achieve in a safer way but creates issues when in comes to reliability. Because of this, this feature was deprecated in v1.5 and completely removed in v2.0 of elasticsearch.
Saying that, if you still want to use async and care about getting the latest results, you have to use it with primary preference.
In case of update operation, you don't have to do anything. Update is always performed on the primary shard first and then the result of the operation is replicated to all replicas.
My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!
By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks
No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.