Elasticsearch filtering with input array where - elasticsearch

Our requirement is to filter objects by an array field of data by giving an input array to elasticsearch. Any combination input array elements is match with mentions array.
Small example
data:[
{"name": "xxxx", "mentions": ["X", "Y"]},
{"name": "yyyy", "mentions": ["K", "L", "M"]},
{"name": "zzz", "mentions": ["X", "L"]},
]
Input: [X, Y, K, L]
Output:[
{"name": "xxxx", "mentions": ["X", "Y"]},
{"name": "zzz", "mentions": ["X", "L"]}
]
Objects must be filtered according to mentions field, where each member of mentions array must be in the given input array, if there is any inconsistency, then ignore the object.
Terms query or bool with must field is not solving our problem.

A very simplistic solution is to make use of a Regex Expression in a Regex Query:
Below is how your query would be:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must_not": [ <---- Note this.
{
"regexp": {
"mentions": "[^XYKL]" <---- Note this.
}
}
]
}
}
}
Square Brackets [...] would mean to match one of the characters present.
What I've done is simply used a Negate Character ^ inside the bracket and wrapped that Regex Logic inside a must_not clause of a Bool Query and it should give you what you are looking for.
The query would only return documents with values X Y K L values. Any other values barring that, it would not return those documents.
Note that I'm assuming the field mentions is of type keyword.

Related

Reorder object hierarchy and group by time in JSONata

Although I'm not a total JSONata noob, I'm having a hard time finding an elegant solution to the following desired transformation. The starting point is a set of time-series data in a format like this:
{
"series1": {
"data": [
{"time": "2022-01-01T00:00:00Z", "value": 22},
{"time": "2022-01-02T00:00:00Z", "value": 23}
]
},
"series2": {
"data": [
{"time": "2022-01-01T00:00:00Z","value": 220},
{"time": "2022-01-02T00:00:00Z","value": 230}
]
}
}
I need to "flip the hierarchy", and group these datapoints by timestamp, into an array of objects, like follows:
[
{
"time": "2022-01-01T00:00:00Z",
"series1": 22,
"series2": 220
},
{
"time": "2022-01-02T00:00:00Z",
"series1": 23,
"series2": 230
}
]
I currently have this working with the expression
$each($, function($v, $s) {
[$v.data.{
'series': $s,
'time':$.time,
'value': $.value
}]
}).*{
`time`: {
`series`: value
}
}
~> $each(function($v, $t) {
$merge([
$v,
{'time': $t}
])
})
(playground link: https://try.jsonata.org/8CaggujJk)
...and...I can't help but feel that there must be a better way!
For reference, my current expression basically does this in three consecutive steps:
The first $each() function, which splits up the original object into an array of datapoints, with a series name, timestamp, and value of each.
A grouping operator which makes time a key, and gathers all values for a given timestamp together.
A second $each() function, which transforms the object into an array of objects where time is a value again, rather than a key - and merges the time key-value alongside the series values.
I've seen some wonderfully elegant solutions to similar problems on here, but am not sure how to approach this in a better way. Any tips appreciated!

Search across a searchable field in Elasticsearch

I'm looking for a way of searching across a tokenized field in Elasticsearch, so instead of returning the Elements indexed with my search, return a unique set of values that matched the best.
{
"id": 1,
"brand": [
"word1",
"another"
]
},
{
"id": 2,
"brand": [
"word2",
"word3",
"yet_another"
]
}
So searching for wo, I would recieve a list of the words word1, word2 and word3 scored, of course.
Should I create a new index for that with these values?
Is there a way I can do that work by reusing the tokenization of my index?

Combine queries and order results by score

I want Elastic to execute multiple (multi-match) queries and sort them by score. The score of each query should be calculated indepentent of the other queries (which is different from what I have googled so far with the bool/should clause I think).
Example:
Query 1:
"multi_match" : {
"query": "test",
"fields": ["a", "b", "c"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}
Query 2:
"multi_match" : {
"query": "test2",
"fields": ["a", "b", "c"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}
Combine both results and order by score. How can I do that with Elastic?
I believe Dis Max query is what you are looking for:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Elasticsearch: how to know which field the results are sorted by?

In Elasticsearch, is there any way to check which field the results are sorted by? I want something like inner-hits for sort clause.
Imagine that your documents have this kind of form:
{"numerals" : [ // nested
{"key": "point", "value": 30},
{"key": "points", "value": 200},
{"key": "score", "value": 20},
{"key": "scores", "value": 40}
]
}
and you sort the results by:
{"numerals.value": {
"nested_path": "numerals",
"nested_filter": {
"match": {
"numerals.key": "score"}}}}
Now I have no idea how to know the field by which the results are actually sorted: it's probably scores at this document, but is perhaps score at the others? There are 2 problems - 1. You cannot use inner-hits nor highlight for the nested fields. and - 2. Even if you can, it doesn't solve the issue if there are multiple matching candidates.
The question is about sorting by fields that are inside nested objects.
So this is what the documention
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-sorting.html
and
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_nested_sorting_example
says:
Elasticsearch will first restrict the nested documents by the "nested_filter"-query and then sort on the same way as for multi-valued fields:
Exactly the way as if there would be only the filtered nested documents as inner objects aka as if there would be only the root document with a multi-valued field which contains exactly all value which belong to the filtered nested objects
( in your example there will only one value remain: 20).
If you want to be sure about the sort order insert a "mode" parameter:
"min", "max", "sum", "avg" or "median"
If you do not specify the "mode" parameter according to the corresponding issue the min-value will be picked for "asc" and the max-value will be picked for "desc"-order:
By default when sorting on a multi-valued field the lowest or highest
value will be picked from the field values depending on the sort
order.

Resources