I really really need some helps on elastic search usage in java api...
Let's assume I am using java api from ES.
So far, I understand that elastic search can give inconsistent result due to primary and replica's inconsistency issue(deleting doc makes stats difference in overall due to deletion marking instead of delete it).
So what I tried it
searchRequest.preference("_primary_first").
This gave me consistent result(since it only uses primary shard!)
Now what I want to try in my toy example is,
1) using preference=Custom (string) value
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-preference
2) if I have 5 nodes, I want to designate which node we want to use based on the queryText.
For instance,
'''
if (queryRequest.text().equals("red")) {
// use 1st node
searchRequest.preference("??????")
} else if (queryRequest.text().equals("blue")) {
// use 2nd node
searchRequest.preference("??????")
} else {
// use either 3rd~5th node <- but this is not necessary if it is really hard..
searchRequest.prefernce("???????")
}
'''
Q1)
I guess I need to use custom setting "WISELY" to denote which node to use...
can someone give me simple java.api example?
Q2)
This is another one, but is there any way we can load status for each node from searchResponse?(again in java api friendly)
Q3)
Is there any clever way to specify to use 1st Node(or certain Node Id??) with given query text?(instead using hashmap things...)
For instance,
let say I don't know which query text I will receive, but I want to evenly distribute them to each node(among 5!)
But want to stick with the first choice.
if I see very first query text == "red" and I designate this queryRequest to use Node1, then later I also want to use Node1 if I see the query text == "red" again.. Does someone have idea?
Thank you guys!
Disclaimer:
I am non-CS guy and independant learner who tried to experiment new things to break my comfort zones! :) Please excuse this silly question!
Actually it's not a silly question and the answer has two parts.
You mention nodes and you want to control which node gets what queries based on an attribute.
Some context:
An elasticsearch cluster has elasticsearch nodes
Your documents will be "saved" in an elasticsearch index and the queries you perform will be against that index
An elasticsearch index is but an abstraction, a layer that hides the complexity of shards (basically lucene indices).
Now when you save a document, that document will eventually be stored in a shard (there are segments etc, but no reason to go any further). Now you can have primary shards and replica shards. When you save something, that will go to a primary shard and will be replicated by elasticsearch to the replica shards (if any). Your searches can and will be served both by primary and replica shards.
Now, you want to control which node gets what. What you can control is which shard gets what via routing on save and via routing on search.
You've asked to control which node get's what. Most of the times you won't be needing this. What you can control is what shard gets what, so you'll need to control which node gets what shard. This can be accomplished via shard allocation awareness.
Both of these topics are advanced features and you'll need to make sure to know what you are doing when trying to use them or you'll get very unexpected results.
Related
Elasticsearch Parent/Child nested relationship impose having the parent and children on the same shard by using the _routing field during ingesting.
I was wondering if using the same process would provide performance improvement while using the collapse feature of elasticsearch or would it make it worst?
If we look at both cases:
1) Routing to the same shard: the shard is able to do the collapsing on its own and return already fully collapsed documents
2) Document are on many shards: the collapse can only happen later with all shards returning lots of documents that will be collapsed later.
I do not know if elasticsearch will do the 2nd even though documents where on the same shard.
Thanks.
The full genesis of field collapsing (introduced in ES 5.3) can be found in PR 22337 (issue 21833).
Initially, the idea was to create a new top_groups aggregation, modeled after a terms+top_hits combo, but in the end it was deemed to costly to implement and not necessarily optimal.
Field collapsing has finally been implemented in the search layer, because it can benefit from the existing query/fetch phases and requires a lot less memory that doing it as an aggregation. Also pagination would work out of the box as well.
It was discussed whether it would be a good idea to use the grouping field as a routing key to make sure all top hits were located on the same shard, but in the end this was deemed too big a limitation.
So, long story short, with field collapsing there is no such restriction to locate all documents on the same shard because the fetch request (phase 2) will be sent to all shards anyway.
As always, the best way is to try it out for yourself and measure the performance.
1 index with 1 shard (with and without routing key)
1 index with several shard (with and without routing key)
My take is that it would make no big difference, because only the top hits are collapsed and a normal search query (without field collapsing) would go through both query/fetch phases as well anyway.
I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.
I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.
We are designing a large framework around Elasticsearch and are investigating a few options.
For some complex analysis jobs, we are looking for a way to retrieve data from only the currently connected Elasticsearch node, i.e. only data from the primary shard on the node that I am connected to via the client or no result if there is no primary shard located on this node?
Is this possible via some search attribute or via more specialized setup?
We want to use the normal Elasticsearch functionality as much as possible, naturally, but sometimes there might be queries that need this type of access, is this doable with Elasticsearch?
You can restrict the search to specific shards using the preference query string parameter (see https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-preference.html).
e.g. by sending your query to http://ES-NODE:9200/INDEXNAME/_search?preference=_shards:1
you should be able to restrict the query to shard 1
I have a scan/scroll query where each document that comes back has something done to it and is then the changes are written back. Basically mapping over the whole index (or document type actually).
If the function applied during this mapping starts to become too slow then I need to find a way to split this across several machines.
I could share a scroll ID across multiple machines using Zookeeper or something but will there be issues querying ES from 2 clients at almost the same time?
Alternatively, is there a way to write a query that will only run against one specified node? This way, if I had one 'mapping process' on the same box as one node then I could remove the network overhead.
Check "_only_node" or "_prefer_node" option in ElasticSearch API.