Different results for same query in Elasticsearch Cluster - elasticsearch

I have created a Elasticsearch cluster with 3 nodes , having 3 shards and 2 replicas.
The same query fetch different results when hit to the same index with same data.
Right now the results are basically sorted by the _score field desc (I think its the default way of sorting) and requirement also wants that the result be sorted in desc order of there score.
So here my question is why does same query yield different result, and then how can this be corrected to have same result every time with same query.
query attached
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": {
"bool": {
"must": {
"terms": {
"context": [
"my name"
]
}
},
"should": {
"multi_match": {
"query": "test",
"fields": [
"field1^2",
"field2^2",
"field3^3"
]
}
},
"minimum_should_match": "1"
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"audiencecomb": [
"1235"
]
}
},
{
"terms": {
"consumablestatus": [
"1"
]
}
}
],
"minimum_should_match": "1"
}
}
}
}
}

One of the possible reasons could be distributed IDF, by default Elastic uses local IDF on each shard, to save some performance which will lead to different idfs across the cluster. So, you should try ?search_type=dfs_query_then_fetch, which will explicitly asks Elastic to compute global IDF.
However, for performance reasons, Elasticsearch doesn’t calculate the
IDF across all documents in the index. Instead, each shard calculates
a local IDF for the documents contained in that shard.
Because our documents are well distributed, the IDF for both shards
will be the same. Now imagine instead that five of the foo documents
are on shard 1, and the sixth document is on shard 2. In this
scenario, the term foo is very common on one shard (and so of little
importance), but rare on the other shard (and so much more important).
These differences in IDF can produce incorrect results.
In practice, this is not a problem. The differences between local and
global IDF diminish the more documents that you add to the index. With
real-world volumes of data, the local IDFs soon even out. The problem
is not that relevance is broken but that there is too little data.
For testing purposes, there are two ways we can work around this
issue. The first is to create an index with one primary shard, as we
did in the section introducing the match query. If you have only one
shard, then the local IDF is the global IDF.
The second workaround is to add ?search_type=dfs_query_then_fetch to
your search requests. The dfs stands for Distributed Frequency Search,
and it tells Elasticsearch to first retrieve the local IDF from each
shard in order to calculate the global IDF across the whole index.
For more information take a look here

Related

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

How to get only x results from elastic and then stop searching?

My whole index is about 700M docs, this query:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10
}
applies to ca 5M docs. "Some_field" is indexed, not analysed.
Query takes ca 1s on average hetzner. Too slow :) I don't care about pagination or sorting or scoring. I just want 10 first "random" matching docs.
Is there the way to do it with disabled score, in the "mysql way"?
filter or constant_score do not help
If you go with filters, that will remove the score computation and should provide faster query speeds:
{
"query": {
"bool": {
"filter": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
}
}
}
"size": 10
}
If that's still too slow, you could consider using document routing, but it may not be a viable option for you as you might have just 1 shard or very few terms for SOME_FIELD.
I also suggest you go over the production deployment document by Elastic, it gives you an overview on how to configure your cluster optimally and can also produce some serious performance boost in case you currently have a misconfigured cluster, i.e. running on a strong machine but keeping the default ES_HEAP_SIZE value.
The option i was looking for is "terminate_after". Unfortunately it is not "very well" documemented, see:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-limit-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-count.html#_request_parameters
so, my query looks like this:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10,
"terminate_after": 10
}
Don't use "10" instead of 10. Elastic does not cast it to integer and ignores the parameter

Return list of affected indices from in Elasticsearch

I need to write a query which will search across all indices in Elastisearch and return me a list of all indices where at least one document meets query requirements.
For now I`m getting top 2000 documents and distinct them by index name.
To search across all indices in the elastcsearch, you can use the _all option.
You can try similar to following, to get the indices which gets hits for the query
POST _all/_search
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "you search criteia"
}
}
}
}
}
Most APIs that refer to an index parameter support execution across multiple indices, using simple test1,test2,test3 notation (or _all for all indices)
You can extract the index name from the result set which will be present under _index
sample result:
"hits": [
{
"_index": "index-name",
}
]

Elasticsearch and aggregation of subqueries

I know that elasticsearch allows sub-aggregations (ie. nested aggregation), however I would like to apply aggregation on the result of "first" aggregation (or in generic any query - aggregation or not).
Concrete example: I log events about user actions (for simplicity I have documents with user_id and action). I can make a query that counts number of actions executed by each user. However I would like to find out percentage (or count) of "active users" (e.g. users that have executed more than 10 actions). Ideal result would be a histogram over all users showing how active the users are.
Is there a way how to create such query? Or is there any other approach I can take other than store aggregated results of subquery and compute the histogram out of that?
Note: I have seen Elastic Search and "sub queries" question, but it was about something else and it is over one and half year old and elasticsearch is being actively developed.
Additionally it seems that in version 1.4 there will be available scripted metric aggregation, but anyway that would require to store counter for every user until reduce phase. And some "approximate solution" is good for me - similar to what ES uses internally for its aggregations.
Here is the query I have used, notice the "min_doc_count" in the aggregation.
{
"query": {
"filtered": {
"filter": {
"and": [
{ "term" : { "name": "did x" } },
{ "range": { "created_at": { "gte": "now-7d", "lte": "now" } } }
]
}
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "user_id",
"min_doc_count": 10,
"size": 0
}
}
}
}
This query returns the list of buckets (users) with more than 9 events in the specified time period. Just 'count' results to get the number of active users.
I have tested this approach with thousands of events and it works well. At a certain scale you will have to use Hadoop.

elasticsearch: if i add more servers nodes can i get better query performance?

I am new to elasticsearch, i am testing a one node on my windows 7, i have indexed 2 millions documents, but the (match) query time is increasing, about 3 seconds (uncached) and 1.5 seconds (cached).
I would like to maintain bellow 1 second query if i go in production so my question is:
If i add more servers (nodes) can i get better query performance, supposing the hardware is good for each server and ES configuration optimized. for example if my data grows and i add n servers (nodes) does this mean that i get lower query time (below 1 second) ? is this what "scalling" mean for elasticsearch ?
here is my query (unfiltered one) i need scores too:
json = '{
"from" : 0, "size" : 10,
"query" : {
"bool" : {
"should": [
{ "match": { "answer_1_words": "dooms best aynol steven" }},
{ "match": { "answer_2_words": "mokrane obione kenobi zembla" }},
{ "match": { "answer_3_words": "Benghazi fake yahai tperdina" }},
{ "match": { "answer_4_words": "jackson thisisit bonzai peterpan" }},
{ "match": { "answer_5_words": "Zohra Drif mami jenaipas" }},
{ "match": { "answer_6_words": "Bon wa3lah hagda hamoud" }},
{ "match": { "answer_7_words": "cola coca petrole seule" }},
{ "match": { "answer_8_words": "dieu help salut bentley" }},
{ "match": { "answer_9_words": "edite piaf chanson merci" }},
{ "match": { "answer_10_words": "gooloom seigneur anneaux espace" }}
]
}
}
}'
Shards spread your data out across multiple servers allowing queries to be done in parallel, but you can't change the number of shards on an index after it's created. If your indexes are time-based, then that's not such a bad thing since you'll be creating new indexes all of the time.
When a query comes into ES, it'll split it into however many pieces you have shards and then do a map/reduce operation on the result. If you have multiple queries going on simultaneously, they'll be divided up among the replicas/primaries.
So if your use case is one power user making single queries at a time, you want to add shards/machines and re-index your data. If your use case is lots of users hitting at the same time, you want to add more machines and replicas (ie this is the scale wide due to load).
You want to keep your single shard size down in the 2-4GB range... so if you are not using time series data, you'll want to allocate enough shards/machines to deal with your future data growth.
The more replicas you add though, the slower your initial indexing is going to be, so there is that trade-off.

Resources