How to find records matching the result of a previous search using ElasticSearch Painless scripting - elasticsearch

I have the index I attached below.
Each doc in the index holds the name and height of Alice or Bob and the age at which the height was measured. Measurements taken at the age of 10 are flagged as "baseline_height_at_age_10": true
I need to do the following:
Find the height of Alice and Bob at age 10.
List item Return for Alice and Bob, the records where the height is lower than their height at age 10.
So my question is: Can Painless do such type of search?
I'd appriciate if you could point me at a good example for that.
Also: Is ElasticSearch Painless even a good approach for this problem? Can you sugges
The Index Mappings
PUT /shlomi_test/
{
"mappings": {
"_doc": {
"properties": {
"first_name": {
"type": "keyword",
"fields": {
"raw": {
"type": "text"
}
}
},
"surname": {
"type": "keyword",
"fields": {
"raw": {
"type": "text"
}
}
},
"baseline_height_at_age_10": {
"type": "boolean"
},
"age": {
"type": "integer"
},
"height": {
"type": "integer"
}
}
}
}
}
The Index Data
POST /test/_doc/alice_green_8_110
{
"first_name": "Alice",
"surname": "Green",
"age": 8,
"height": 110,
"baseline_height_at_age_10": false
}
POST /test/_doc/alice_green_10_120
{
"first_name": "Alice",
"surname": "Green",
"age": 10,
"height": 120,
"baseline_height_at_age_10": true
}
POST /test/_doc/alice_green_13_140
{
"first_name": "Alice",
"surname": "Green",
"age": 13,
"height": 140,
"baseline_height_at_age_10": false
}
POST /test/_doc/alice_green_23_170
{
"first_name": "Alice",
"surname": "Green",
"age": 23,
"height": 170,
"baseline_height_at_age_10": false
}
POST /test/_doc/bob_green_8_120
{
"first_name": "Alice",
"surname": "Green",
"age": 8,
"height": 120,
"baseline_height_at_age_10": false
}
POST /test/_doc/bob_green_10_130
{
"first_name": "Alice",
"surname": "Green",
"age": 10,
"height": 130,
"baseline_height_at_age_10": true
}
POST /test/_doc/bob_green_15_160
{
"first_name": "Alice",
"surname": "Green",
"age": 15,
"height": 160,
"baseline_height_at_age_10": false
}
POST /test/_doc/bob_green_21_180
{
"first_name": "Alice",
"surname": "Green",
"age": 21,
"height": 180,
"baseline_height_at_age_10": false
}

You should be able to do it just using aggregations. Assuming people only ever get taller, and the measurements are accurate, you could restrict the query to only those documents aged 10 or under, find the max height of those, then filter the results of those to exclude the baseline result
POST test/_search
{
"size": 0,
"query": {
"range": {
"age": {
"lte": 10
}
}
},
"aggs": {
"names": {
"terms": {
"field": "first_name",
"size": 10
},
"aggs": {
"max_height": {
"max": {
"field": "height"
}
},
"non-baseline": {
"filter": {
"match": {
"baseline_height_at_age_10": false
}
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}

I've posted the same question, with emphasis on Painless scripting, ElasticSearch Support Forum How to find records matching the result of a previous search using ElasticSearch Painless scripting
and the answer was:
"I don't think the Painless approach will work here. You cannot use
the results of one query to execute a second query with Painless.
The two-step approach that you outline at the end of your post is the
way to go."
The bottom line is that you cannot use a result from one query as an input to another query. You can filter and aggregate and more, but not this.
So the approcah is pretty much as follows:
according to my understanding, suggests to do the 1st search, process
the data and do an additional search. This basically translates to:
Search the record where first_name=Alice and baseline_height_at_age_10=True.
Process externally, to extract the value of height for Alice at age 10.
Search for Alice's records where her height is lower than the value calculated externally.

Related

k-NN multiple field search in OpenSearch

Assume that we have this index in OpenSearch:
{
"settings": {
"index.knn": True,
"number_of_replicas": 0,
"number_of_shards": 1,
},
"mappings": {
"properties": {
"title": {"type": "text"},
"tag": {"type": "text"},
"e1": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e2": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e3": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
}
},
}
And we want to perform a search over all the fields (approximate knn for the vector fields). What would be the correct way to do this in OpenSearch?
I have this query that works but I'm not sure if it is the correct way of doing this and if it uses approximate knn:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"knn": {
"e1": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e2": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e3": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"match": {"title": "title"}
},
"weight": 0.1,
}
},
{
"function_score": {
"query": {"match": {"tag": "tag"}},
"weight": 0.1,
}
},
]
}
},
"_source": False,
}
In other words, I want to know how this which is for ElasticSearch can be done in OpenSearch.
Edit 1:
I want to do this Elasticsearch new feature in OpenSearch. The question is how and also what does the query mentioned above does exactly.
First of all, searching multiple kNN fields in Elasticsearch is not yet supported.
Here you can find the development, not yet released, related to issue #91187 and PR #92118 that was merged for version 8.7... the current version is 8.6.
Looking at the OpenSearch documentation for k-NN, it does not appear to be supported there either.
However, regarding the query you provided:
knn search was not defined well... the right way is, for example:
{
"query": {
"knn": {
"my_vector": {
"vector": [2, 3, 5, 6],
"k": 2
}
}
}
}
where my_vector is the name of your vector field while vector is the query vector (i.e. query text encoded into the corresponding vectors) that must have the same number of dimensions as the vector field you are searching against.
the match query value was not defined well. Here the documentation.
the use of the function_score is unclear and not properly correct.
Finally, if you are interested in vector search with OpenSearch, we recently wrote a blog post in which we provide a detailed description of the new neural search plugin introduced with version 2.4.0 through an end-to-end testing experience.

Elasticsearch Term suggester is not returning correct suggestions when one character is missing (instead of misspelling)

I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.
I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).
Here is my suggester query body :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject",
"suggest_mode": "popular",
"min_word_length": 1
}
}
}
}
When I send above query, I get below response :
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
}
As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".
I even played with parameters like suggest_mode and string_distance but still no luck.
I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :
{
"suggest": {
"text": "sofa",
"subjectuggester": {
"phrase": {
"field": "subject",
"size": 10,
"gram_size": 3,
"direct_generator": [
{
"field": "subject.trigram",
"suggest_mode": "always",
"min_word_length":1
}
]
}
}
}
}
I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing.
while it works fine for misspells e.g it works fine for "vovlo".
When I send "vovlo" it gives me "volvo".
Any help would be hugely appreciated.
Try changing the "string_distance".
{
"suggest": {
"text": "sof",
"subjectSuggester": {
"term": {
"field": "title",
"min_word_length":2,
"string_distance":"ngram"
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#term-suggester
I've found the workaround myself.
I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.
Here is the mapping changes :
{
"settings": {
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
],
"char_filter": [
"diacritical_marks_filter"
]
}
}
}
},
"mappings": {
"properties": {
"subject": {
"type": "text",
"fields": {
"trigram": {
"type": "text",
"analyzer": "trigram"
}
}
}
}
}
}
And here is my corrected query :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject.trigram",
"suggest_mode": "popular",
"min_word_length": 1,
"string_distance": "ngram"
}
}
}
}
Note that I'm performing suggester to subject.trigram instead of subject itself.
Here is the result :
{
"suggest": {
"subjectSuggester": [
{
"text": "sofa",
"offset": 0,
"length": 4,
"options": [
{
"text": "soffa",
"score": 0.8,
"freq": 282
},
{
"text": "soffan",
"score": 0.6666666,
"freq": 5
},
{
"text": "som",
"score": 0.625,
"freq": 102
},
{
"text": "sol",
"score": 0.625,
"freq": 82
},
{
"text": "sony",
"score": 0.625,
"freq": 50
}
]
}
]
}
}
As you can see above soffa appears as first suggestion.
There is sth weird in your result for the term suggester for the word sofa, take a look at the text that is being corrected:
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
As you can see it's sof and not sofa which means the correction is not for sofa but instead it's for sof, so I doubt that this issue is related to the analyzer you were using on this field, especially when looking at the results soff instead of soffa it's removing the last a

How to sort data in elastic search based on the filter data

I am relatively new to this elastic search. So I have data stored in the elastic search in a below-mentioned way:
[{
"name": "user1",
"city": [{
"name": "city1",
"count": 18
},{
"name": "city2",
"count": 15
},{
"name": "city3",
"count": 10
},{
"name": "city4",
"count": 5
}]
},{
"name": "user2",
"city": [{
"name": "city2",
"count": 2
},{
"name": "city5",
"count": 5
},{
"name": "city6",
"count": 8
},{
"name": "city8",
"count": 15
}]
},{
"name": "user3",
"city": [{
"name": "city1",
"count": 2
},{
"name": "city5",
"count": 5
},{
"name": "city7",
"count": 28
},{
"name": "city2",
"count": 1
}]
}]
So, what I am trying to do is, find out those users who have "city2" in their city list and order the data based on the "count" of "city2".
Here is my query what I have tried:
{
"sort": [{
"city.count": {
"order" : "desc"
}
}],
"query": {
"bool": {
"must": [
{"match": {"city.name": "city2"}}
]
}
}
}
So I am not able to figure out the sort part how to do it!
The sorting part is considering all the "count" value of all the cities based on the filter, but I just want the order to happen only based on the "count" of "city2".
Any kind of help would be appreciated. Thanks in advance.
Since the field city is object and not nested object, what you are trying to achieve won't be possible. The reason for this is when you define a field as object, elastics flattens each of the object field values as an array. So,
"city": [
{
"name": "city1",
"count": 18
},
{
"name": "city2",
"count": 15
},
{
"name": "city3",
"count": 10
},
{
"name": "city4",
"count": 5
}
]
is indexed as :
"city.name" : ["city1", "city2", "city3", "city4"]
"city.count": [18, 15, 10, 5]
As you can see, because of the way elastic index the object the relation between each city and its count is lost.
So, whenever you want to preserve the relation you should define the field as nested type.
{
"city": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"count": {
"type": "long"
}
}
}
}
Sorting then can be achieved by using this nested field.
{
"sort": [
{
"city.count": {
"order": "desc",
"mode": "avg",
"nested": {
"path": "city",
"filter": {
"match": {
"city.name": "city2"
}
}
}
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"city.name": "city2"
}
}
]
}
}
}
Reaching your goal will be a little complex.
First, your query says that you want to get the docs with "city2" in them. Since at least one of the elements in the array "city" matches, the whole document will be returned.
The problem is that you only want to return the count for city2, not for all of them. This is where the complex part comes.
There are plenty of paths you can follow:
Change your index design. Instead of having an array of users, have one document per user with all their info, including the cities they have visited. However, the "I only want 1 element from the array" problem will still be there, but you will only will fight with one array at time, instead of n.
You can use Painless to only bring back the count of that particular city, but it would imply a lot of scripting. Don't trust the name. Painless is very Painful.
You can bring back all the elements and do the filtering within your code. For example, if you use the Python Elasticsearch Client, you can execute the query, return all the objects and only selec the wanted elements with Python.
Don't consider using the Terms aggregation. It would bring back the total counting of all the cities, without having the relationship with each user. And this is not what you want to do.
Hope this is helpful and sorry we can't get a straight-forward solution :(

Extract record from multiple arrays based on a filter

I have documents in ElasticSearch with the following structure :
"_source": {
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price": [
"€ 139",
"€ 125",
"€ 120",
"€ 108"
],
"max_occupancy": [
2,
2,
1,
1
],
"type": [
"Type 1",
"Type 1 - (Tag)",
"Type 2",
"Type 2 (Tag)",
],
"availability": [
10,
10,
10,
10
],
"size": [
"26 m²",
"35 m²",
"47 m²",
"31 m²"
]
}
}
Basically, the details records are split in 5 arrays, and fields of the same record have the same index position in the 5 arrays. As can be seen in the example data there are 5 array(price, max_occupancy, type, availability, size) that are containing values related to the same element. I want to extract the element that has max_occupancy field greater or equal than 2 (if there is no record with 2 grab a 3 if there is no 3 grab a four, ...), with the lower price, in this case the record and place the result into a new JSON object like the following :
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price: ": "€ 125",
"max_occupancy": "2",
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
}
Basically the result structure should show the extracted record(that in this case is the second index of all array), and add the general information to it(fields : "last_updated", "country").
Is it possible to extract such a result from elastic search? What kind of query do I need to perform?
Could someone suggest the best approach?
My best approach: go nested with Nested Datatype
Except for easier querying, it easier to read and understand the connections between those objects that are, currently, scattered in different arrays.
Yes, if you'll decide this approach you will have to edit your mapping and re-index your entire data.
How would the mapping is going to look like? something like this:
{
"mappings": {
"properties": {
"last_updated": {
"type": "date"
},
"country": {
"type": "string"
},
"records": {
"type": "nested",
"properties": {
"price": {
"type": "string"
},
"max_occupancy": {
"type": "long"
},
"type": {
"type": "string"
},
"availability": {
"type": "long"
},
"size": {
"type": "string"
}
}
}
}
}
}
EDIT: New document structure (containing nested documents) -
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"records": [
{
"price": "€ 139",
"max_occupancy": 2,
"type": "Type 1",
"availability": 10,
"size": "26 m²"
},
{
"price": "€ 125",
"max_occupancy": 2,
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
},
{
"price": "€ 120",
"max_occupancy": 1,
"type": "Type 2",
"availability": 10,
"size": "47 m²"
},
{
"price": "€ 108",
"max_occupancy": 1,
"type": "Type 2 (Tag)",
"availability": 10,
"size": "31 m²"
}
]
}
Now, its more easy to query for any specific condition with Nested Query and Inner Hits. for example:
{
"_source": [
"last_updated",
"country"
],
"query": {
"bool": {
"must": [
{
"term": {
"country": "Italia"
}
},
{
"nested": {
"path": "records",
"query": {
"bool": {
"must": [
{
"range": {
"records.max_occupancy": {
"gte": 2
}
}
}
]
}
},
"inner_hits": {
"sort": {
"records.price": "asc"
},
"size": 1
}
}
}
]
}
}
}
Conditions are: Italia AND max_occupancy > 2.
Inner hits: sort by price ascending order and get the first result.
Hope you'll find it useful

elastic search suggester filter

I am implementing suggester filter for search operation using elastic search API.
I have encountered problem like I can do search base in prefix search only, but I cant do with middle word.
I had tried below example :
PUT / bls {
"mappings": {
"bl": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"name_suggest": {
"type": "completion",
"context": {
"store": {
"type": "category"
},
"status": {
"type": "category"
}
}
}
}
}
}
}
and
POST / bls / bl / 1 {
"name": "LG 32LN5110 32 inches LED TV",
"name_suggest": {
"input": ["sony 32LN5110 32 inches LED TV"],
"context": {
"store": [
44,
45
],
"status": "Active"
}
}
}
POST / bls / _suggest ? pretty {
"name_suggest": {
"text": "sony",
"completion": {
"field": "name_suggest",
"context": {
"store": "44",
"status": "Active"
}
}
}
}
I got result with above query but I cant do search with below query :
POST / bls / _suggest ? pretty {
"name_suggest": {
"text": "LED",
"completion": {
"field": "name_suggest",
"context": {
"store": "44",
"status": "Active"
}
}
}
}
and this above query display results as below :
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"name_suggest": [{
"text": "LED",
"offset": 0,
"length": 3,
"options": []
}]
}
The String type are indexed by default. So even without specifying the type they are indexed with Default Analyzer if no specific analyzer was specified.
For your case, you must specify the
index: analyzed for name_suggest property
Such that an Anayzer containing whitespace analyzer is used, which will tokenize all the words in your input field. And hence can search anywhere across the text.

Resources