elasticsearch custom_score multiplication is inaccurate - elasticsearch

I've inserted some documents which are all identical except for one floating-point field, called a.
When script of a custom_score query is set to just _score, the resulting score is 0.40464813 for a particular query matching some fields. When script is then changed to _score * a (mvel) for the same query, where a is 9.908349251612433, the final score becomes 4.0619955.
Now, if I run this calculation via Chrome's JS console, I get 4.009394996051871.
4.0619955 (elasticsearch)
4.009394996051871 (Chrome)
This is quite a difference and produces an incorrect ordering of results. Why could it be, and is there a way to correct it?

If I run a simple calculation using the numbers you provided, then I get the result that you expect.
curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"a" : 9.90834925161243
}
'
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"custom_score" : {
"script" : "0.40464813 *doc[\u0027a\u0027].value",
"query" : {
"match_all" : {}
}
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "a" : 9.90834925161243
# },
# "_score" : 4.009395,
# "_index" : "test",
# "_id" : "lPesz0j6RT-Xt76aATcFOw",
# "_type" : "test"
# }
# ],
# "max_score" : 4.009395,
# "total" : 1
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 1
# }
I think what you are running into here is testing too little data across multiple shards.
Doc frequencies are calculated per shard by default. So if you have two identical docs on shard_1 and one doc on shard_2, then the docs on shard_1 will score lower than the docs on shard_2.
With more data, the document frequencies tend to even out over shards. But when testing small amounts of data you either want to create an index with only one shard, or to add search_type=dfs_query_then_fetch to the query string params.
This calculates global doc frequencies across all involved shards before calculating the scores.
If you set explain to true in your query, then you can see exactly how your scores are being calculated

Related

what does the total hits mean without specify track_total_hits=true?

When send query to elasticsearch, I can get a response which has a hits->total like below:
"hits" : {
"total" : {
"value" : 10000,
"relation" : "eq"
},
...
I think the value inside total indicates the count of document matched the query. However the value is not accurate unless I specify track_total_hits=true in the query. The question is what the value mean if I don't specify track_total_hits?
This value is configured in: index.max_result_window.
10000 is the default
For performance reason, this value is sent when es found more than max_result_window documents.

improving performance of search query using index field when working with alias

I am using an alias name when writing data using Bulk Api.
I have 2 questions:
Can I get the index name after writing data using the alias name maybe as part of the response?
Can I improve performance if I send search queries on specific indexes instead to search on all indexes of the same alias?
If you're using an alias name for writes, that alias can only point to a single index which you're going to receive back in the bulk response
For instance, if test_alias is an alias to the test index, then when sending this bulk command:
POST test_alias/_doc/_bulk
{"index":{}}
{"foo": "bar"}
You will receive this response:
{
"index" : {
"_index" : "test", <---- here is the real index name
"_type" : "_doc",
"_id" : "WtcviYABdf6lG9Jldg0d",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
Common sense has it that searching on a single index is always faster than searching on an alias spanning several indexes, but if the alias only spans a single index, then there's no difference.
You can provide the multiple index names while searching the data, if you are using alias and it has multiple indices by default it would search on all the indices, but if you want to filter it based on a few indices in your alias, that is also possible based on the fields in the underlying indices.
You can read the Filter-based aliases to limit access to data section in this blog on how to achieve it, as it queries fewer indices and less data, search performance would be better.
Also alias can have only single writable index, and name of that you can get as part of _cat/alias?v api response as well, which shows which is the write_index for the alias, you can see the sample output here

Getting child documents

I have an Elasticsearch index. Each document in that index has a number (i.e 1, 2, 3, etc.) and an array named ChildDocumentIds. There are additional properties too. Still, each item in this array is the _id of a document that is related to this document.
I have a saved search named "Child Documents". I would like to use the number (i.e. 1, 2, 3, etc.) and get the child documents associated with it.
Is there a way to do this in Elastisearch? I can't seem to find a way to do a relational-type query in Elasticsearch for this purpose. I know it will be slow, but I'm o.k. with that.
The terms query allows you to do this. If document #1000 had child documents 3, 12, and 15 then the following two queries would return identical results:
"terms" : { "_id" : [3, 12, 15] }
and:
"terms" : {
"_id" : {
"index" : <parent_index>,
"type" : <parent_type>,
"id" : 1000,
"path" : "ChildDocumentIds"
}
}
The reason that it requires you to specify the index and type a second time is that the terms query supports cross-index lookups.

Elasticsearch - Getting multiple documents with multiple custom offset and size 1

Currently, the way I use to get multiple documents with exact query but different positions offset with size 1 is to use Elastic Search Multi Search API. I wonder if there is any better way to do this that would result in a better performance.
The example of current query I am using :
{"index" : "test"}
{"query" : {"term" : { "user" : "Kimchy" }}, "from" : a, "size" : 1}
{}
{"query" : {"term" : { "user" : "Kimchy" }}, "from" : b, "size" : 1}
{}
{"query" : {"term" : { "user" : "Kimchy" }}, "from" : c, "size" : 1}
{}
{"query" : {"term" : { "user" : "Kimchy" }}, "from" : d, "size" : 1}
{}
{"query" : {"term" : { "user" : "Kimchy" }}, "from" : e, "size" : 1}
....
where a,b,c,d,e is a parameter given when query.
If I understand you correctly a,b,c,d,e will all be numbers right?, so you basically want to be able to ask elastic search for say the 3rd, 4th, and 7th documents that show up in a specific query?
I'm not sure if it is the best way to do things, but it would certainly be faster to find the smallest and largest numbers in a through e then do "from : smallest" and "size : largest-smallest". Then take the results that ES returns and go through it yourself to get the specific documents.
Every time you do a from/size query elastic search has to find all the queries before that number anyways so you are currently basically redoing the same search over and over.
This approach does get sketchy if there is a large difference between your smallest and biggest numbers though, and you may end up trying to send back thousands of documents.

Aggregation distinct values in ElasticSearch

I'm trying to get the distinct values and their amount in ElasticSearch.
This can be done via:
"distinct_publisher": {
"terms": {
"field": "publisher", "size": 0
}
}
The problem I've is that it counts the terms, but if there are values in publishers separated via a space e.g.:
"Chicken Dog"
and 5 documents have this value in the publisher field, then I get 5 for Chicken and 5 for Dog:
"buckets" : [
{
"key" : "chicken",
"doc_count" : 5
},
{
"key" : "dog",
"doc_count" : 5
},
...
]
But I want to get as the result:
"buckets" : [
{
"key" : "Chicken Dog",
"doc_count" : 5
}
]
The reason you're getting 5 buckets for each of chicken and dog is because your documents were analyzed at the time that you indexed them.
This means elasticsearch did some small processing to turn Chicken Dog into chicken and dog (lowercase, and tokenize on space). You can see how elasticsearch will analyze a given piece of text into searchable tokens by using the Analyze API, for example:
curl -XGET 'localhost:9200/_analyze?&text=Chicken+Dog'
In order to aggregate over the "raw" distinct values, you need to utilize the not_analyzed mapping so elasticsearch doesn't do its usual processing. This reference may help. You may need to reindex your data to apply the not_analyzed mapping to get the result you want.

Resources