ElasticSearch more_like_this - Are options ran on the source or destination index? - elasticsearch

A useful feature of the more_like_this function is ES is the ability to cross-search different indices, assuming field names and mappings correspond.
One thing that has me confused is how the Term Selection Parameters are applied in these situations.
Consider:
max_doc_freq
The maximum document frequency above which the terms will be ignored from the input document. This could be useful in order to ignore highly frequent words such as stop words. Defaults to unbounded (Integer.MAX_VALUE, which is 2^31-1 or 2147483647).
Is this the document frequency on the source document index? Or will it be applied to the index we are querying?
Example:
GET index_a/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"more_like_this": {
"boost": 1,
"fields": [
"text"
],
"include": true,
"like": [
{
"_id": "tI2N_24BFVRF37fDxSTT",
"_index": "index_b"
}
],
"max_doc_freq": 50000,
"max_query_terms": 50,
"min_term_freq": 1,
"min_word_length": 4,
"minimum_should_match": "1%",
"stop_words": []
}
}
]
}
},
"script_score": {
"script": "1.0"
}
}
}
}
max doc freq in this case is set to 50,000. But is this on index_a? or index_b?

Thats considered in rewrite phrase of query. so index_b . Rewrite phase rewrites MLT to a bool query

Related

Is it possible to affect execution order of filters in Elasticsearch?

We have a query of the form:
{
"query": {
"bool": {
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
},
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"boost": 1
}
}
}
If we remove the third filter (the query_string one), performance is dramatically improved (typically going from around 2000 to 20 ms) for different variants of the above query.
The thing is, the first two filters (on userId and the date range) will always result in only a handful of search hits (say 50 or so).
So, if it was possible to hint that to Elasticsearch, or otherwise affect the query plan, it could solve our issue.
In old (1.x) versions of ES it seems that this was affected by the order of filters. from Elasticsearch: Order of filters for best performance:
"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A."
But newer versions are smarter - https://www.elastic.co/blog/elasticsearch-query-execution-order:
Q: Does the order in which I put my queries/filters in the query DSL matter?
A: No, because they will be automatically reordered anyway based on their respective costs and match costs.
But is it still possible to reach the desired outcome here by modifying the ES search request somehow?
Your query should be like below, so that filters run first and will only select ~50 or so documents and then your costly query_string (because of the leading wildcard) will only run on those 50 docs.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
}
],
"boost": 1
}
}
}

ElasticSearch: obtaining individual scores from each query inside of a bool query

Assume I have a compound bool query with various "must" and "should" statements that each may include different leaf queries including "multi-match" and "match_phrase" queries such as below.
How can I get the score from individual queries packed into a single query?
I know one way could be to break it down into multiple queries, execute each, and then aggregate the results in code-level (not query-level). However, I suppose that is less efficient, plus, I lose sorting/pagination/.... features from ElasticSearch.
I think "Explanation API" is also not useful for me since it provides very low-level details of scoring (inefficient and hard to parse) while I just need to know the score for each specific leaf query (which I've also already named them)
If I'm wrong on any terminology (e.g. compound, leaf), please correct me. The big picture is how to obtain individual scores from each sub-query inside of a bool query.
PS: I came across Different score functions in bool query. However, it does not return the scores. If I wrap my queries in "function_score", I want the scoring to be default but obtain the individual scores in response to the query.
Please see the snippet below:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "...",
"fields": [
"field1^3",
"field2^5"
],
"_name": "must1_mm",
"boost": 3
}
}
],
"should": [
{
"multi_match": {
"query": "...",
"fields": [
"field3^2",
"field4^5"
],
"boost": 2,
"_name": "should1_mm",
"boost": 2
}
},
{
"match_phrase": {
"field5": {
"_name": "phrase1",
"boost": 1.5,
"query": "..."
}
}
},
{
"match_phrase": {
"field6": {
"_name": "phrase2",
"boost": 1,
"query": "..."
}
}
}
]
}
}
}```

Elasticsearch ordering by field value which is not in the filter

can somebody help me please to make a query which will order result items according some field value if this field is not part of query in request. I have a query:
{
"_source": [
"ico",
"name",
"city",
"status"
],
"sort": {
"_score": "desc",
"status": "asc"
},
"size": 20,
"query": {
"bool": {
"should": [
{
"match": {
"normalized": {
"query": "idona",
"analyzer": "standard",
"boost": 3
}
}
},
{
"term": {
"normalized2": {
"value": "idona",
"boost": 2
}
}
},
{
"match": {
"normalized": "idona"
}
}
]
}
}
}
The result is sorted according field status alphabetically ascending. Status contains few values like [active, canceled, old....] and I need something like boosting for every possible values in query. E.g. active boost 5, canceled boost 4, old boost 3 ........... Is it possible to do it? Thanks.
You would need a custom sort using script to achieve what you want.
I've just made use of generic match_all query for my query, you can probably go ahead and add your query logic there, but the solution that you are looking for is in the sort section of the below query.
Make sure that status is a keyword type
Custom Sorting Based on Values
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{ "_score": "desc" },
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"if(params.scores.containsKey(doc['status'].value)) { return params.scores[doc['status'].value];} return 100000;",
"params":{
"scores":{
"active":5,
"old":4,
"cancelled":3
}
}
},
"order":"desc"
}
}
]
}
In the above query, go ahead and add the values in the scores section of the query. For e.g. if your value is new and you want it to be at say value 2, then your scores would be in the below:
{
"scores":{
"active":5,
"old":4,
"cancelled":3,
"new":6
}
}
So basically the documents would first get sorted by _score and then on that sorted documents, the script sort would be executed.
Note that the script sort is desc by nature as I understand that you would want to show active documents at the top, followed by other values. Feel free to play around with it.
Hope this helps!

elastic search function_score query performance

I'm doing function_score queries in elastic search.
The boost weights of the query are determined ad-hoc (and differ between users). Also, the terms that are queried will differ between users depending on context. An example query might look like this:
{
"query": {
"function_score": {
"filter": {
"term": { "in_stock": true },
... more filters ...
},
"functions": [
{
"filter": { "term": { "color": "red" }},
"weight": 2
},
{
"filter": { "term": { "style": "elegant" }},
"weight": 1
},
{
"filter": { "term": { "length": "long" }},
"weight": 3
}
],
"score_mode": "sum",
}
}
}
The document is simple and looks along the lines of:
{
"product_id" : "abc",
"name" : "blah blah",
"price" : 10
"in_stock" : true,
"color: "red",
"style" : "elegant",
"length" : "long",
... more attributes...
}
the mapping types of the filtered terms are keywords and boolean. Not doing any free text stuff anywhere.
The query performance is reasonable until the index size becomes large (around 1 million documents in the index). At that point the query will take multiple seconds to complete.
Index configuration:
I've played around with limiting shard size, currently the shards are limited to 1 million items because after that the performance seems to become even worse. Replication is at 5. The index is read only.
Since the weights and the terms will differ between queries, I'm not sure if it is possible to pre-sort the index in such a way that will speed up the query.
I'm not sure how/if elastic search can cache results, score and ordering in the case of weighted queries.

elasticsearch parent child extremely inefficient has_child query

I have a parent-child relationship in an ES index. The distribution in terms of the number of documents is around 20% for the parents (200M docs) and 80% children (1B docs). ES cluster has 5 nodes, each with 20GB RAM and 4 CPU cores. ES version is 1.5.2. We use 5 shards per index and 0 replication.
When I query it using the has_child, the processing is extremely slow - 170 sec. However, when I just run over the parents it takes less than a second.
This query takes far too long to return and causes timeouts within the application. I really care about the aggregations and time range filter.
I believe what is happening is that the query is running over every child first to do the filtering. In reality, I only would like it to run over the parents first and check if there is a single document and then use filter on the children.
Setup
The _parent is an action that looks like this
{
"a": "m_field",
"b": "b_field",
"c": "c_field",
"d": "d_field"
}
The _child is a timestamp when that action has occurred
{
"date": "2016-07-07T11:11:11Z"
}
These are typically stored in time series indices. Indexes are sharded by a month. An index usually takes around 70GB total size on disk. We choose to run it over an alias, which combines all or some of the most recent indices.
Query
When I query I do a query_string on the _parent document to search for the keyword and a Range filter on the child, using the has_child query.
This looks like the following.
{
"size": 0,
"aggs": {
"base_aggs": {
"cardinality": {
"field": "a"
}
}
},
"query": {
"bool": {
"must": [
{
"filtered": {
"query": {
"query_string": {
"query": "*",
"fields": [
"a",
"b",
"c",
"d",
"e"
],
"default_operator": "and",
"allow_leading_wildcard": true,
"lowercase_expanded_terms": true
}
},
"filter": {
"has_child": {
"type": "evt",
"min_children": 1,
"max_children": 1,
"filter": {
"range": {
"date": {
"lte": "2016-07-06T23:59:59.000",
"gte": "2016-06-07T00:00:00.000"
}
}
}
}
}
}
}
],
"must_not": [
{
"term": {
"b": {
"value": ""
}
}
},
{
"term": {
"b": {
"value": "__"
}
}
}
]
}
}
}
So the query should match on my query_string with the entry "*" and have children that are between the two dates provided. Because I only care about the aggregations I do not return any documents, and I only need to match on a single child document.
Question
How can I improve the speed of the query?
The performance of a has_child query or filter with the min_children
or max_children parameters is much the same as a has_child query with
scoring enabled.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/has-child.html#min-max-children
So I guess, you would have to drop those parameters to speed up the query.

Resources