Let's say that my buisness need is to sort results differently, based on some "external" parameter that i'm passing to query.
Documents are more or less like:
{
"transfer_rate": 2000.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 1, "campaign": 1 },
{ "transfer_rate": 500.00, "identifier": 2, "campaign": 2 },
{ "transfer_rate": 750.00, "identifier": 3, "campaign": 3 },
//...
]
},
{
"transfer_rate": 500.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 4, "campaign": 1 },
{ "transfer_rate": 2000.00, "identifier": 5, "campaign": 2 },
{ "transfer_rate": 625.00, "identifier": 6, "campaign": 3 },
{ "transfer_rate": 225.00, "identifier": 7, "campaign": 1 },
//...
]
}
And now i do have my "parameter", let's say, that's equal to 750.00.
Now, i would like to order this set of documents differently, depends on how different root's transfer_rate is compared to given param as follows:
If doc['transfer_rate'] >= _param then sort by doc['transfer_rate'], else sort by MIN of doc['some_collection'].transfer_rate.
I know that there could be some document optimisations done, but i wasn't inventing this model, nor i'm allowed to change or re-index.
The tricky part about nested objects is, that they do contain property (in given example it's campaign) that has to match criteria, so basically:
When doc['transfer_rate'] is LT than _param_, order by minimum value of doc['some_collection'].transfer_rate where campaign equals to XYZ
So, for given example, with given parameter, documents like first one, should be ordered by doc['transfer_rate'] and documents like second one, should be ordered by nested.
Thanks for any advices / links / support
This is going to be a pain if you can not reindex the data.
I came up with this query
GET /71095886/_search
{
"query": {
"nested": {
"path": "some_collection",
"query": {
"match": {
"some_collection.campaign": 1
}
}
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": """
if (doc['transfer_rate'].value >= params.factor){
return doc['transfer_rate'].value;
} else {
def min = 10000;
for (item in doc['some_collection']){
if (item['transfer_rate'] < min){
min = item['transfer_rate'];
}
}
return min;
}
""",
"params": {
"factor": 2000
}
},
"order": "asc"
}
}
}
But it won't work because of the nested object, and how it is stored in Elastic (actually Lucene, but let's not get down that road .. yet)
If you add "nested_path" : "some_collection" in _script you won't have access to the global transfer_rate anymore (because stored in a different Lucene documents).
Maybe on thing you can look into is runtime fields
Related
How would the following query look:
Scenario:
I have two bases (base 1 and 2), with 1 column each, I would like to see the difference between them, that is, what exists in base 1 that does not exist in base 2, considering the fictitious names of the columns as hostname.
Example:
Selected value of Base1.Hostname is for Base2.Hostname?
YES → DO NOT RETURN
NO → RETURN
I have this in python for the following function:
def diff(first, second):
second = set (second)
return [item for item in first if item not in second]
Example match equal:
GET /base1/_search
{
"query": {
"multi_match": {
"query": "webserver",
"fields": [
"hostname"
],
"type": "phrase"
}
}
}
I would like to migrate this architecture to elastic search in order to generate forecast in the future with the frequency of change of these search in the bases
This could be done with aggregation.
Collect all the hostname from base1 & base2 index
For each hostname count occurrences in base2
Keep only the buckets that have base2 count 0
GET base*/_search
{
"size": 0,
"aggs": {
"all": {
"composite": {
"size": 10,
"sources": [
{
"host": {
"terms": {
"field": "hostname"
}
}
}
]
},
"aggs": {
"base2": {
"filter": {
"match": {
"_index": "base2"
}
}
},
"index_count_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"base2_count": "base2._count"
},
"script": "params.base2_count == 0"
}
}
}
}
}
}
By the way don't forget to use pagination to get rest of the result.
References :
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
https://discuss.elastic.co/t/data-set-difference-between-fields-on-different-indexes/160015/4
I have a field named tags in my document in elasticsearch with the following structure.
tags = [
{
"id": 10,
"related": [9, 8, 7]
}
]
I now run a filter with a list. e.g. [10, 9]. I want to filter only those documents which contain all the items in the list either in id or in related. If I search with [9, 8], the above document should be returned. If I search with [9, 12], the above document shouldn't be returned as 12 isn't present in either id or related.
I tried with terms filter but it simply does or. Is there any technique that can be implemented to achieve the above goal.
Further, I would like to provide a higher ranking to document which contain the given items in id compared to those which contain given items in related.
Problem Analysis
Let's break your problem in the following subproblems:
(P1) Check whether all the terms provided in the array are present in either tags.id or tags.related. This can be further decomposed into:
(P1.1) Check whether all the terms provided in the array are present in a field
(P1.2) Check whether all the terms provided in the array are spread across different fields
(P2) Assign a higher score to those documents having any of the provided terms as tags.id
Solution
To solve (P1.1), you can use the terms_set query, available in Elasticsearch v6.6 (see documentation).
To solve (P1.2), I'd copy all the values of tags.id and tags.related into a new custom field, named, e.g., tags.all. This can be achieved using the copy_to property as follows:
{
"mappings": {
"_doc": {
"properties": {
"tags": {
"properties": {
"id": {
"type": "long",
"copy_to": "tags.all"
},
"related": {
"type": "long",
"copy_to": "tags.all"
}
}
}
}
}
}
}
Then, to solve (P1), you can run your terms_set query against tags.all. E.g.,
{
"query": {
"terms_set": {
"tags.all": {
"terms": [ 9, 8 ],
"minimum_should_match_script": {
"source": "2"
}
}
}
}
}
Finally, to solve (P2), you can create a boolean should query that includes (i) the terms_set query described above, (ii) a terms query against tags.id only, which has a higher boost factor. I.e.,
{
"query": {
"bool": {
"should": [
{
"terms_set": {
"tags.all": {
"terms": [ 9, 8 ],
"minimum_should_match_script": {
"source": "2"
}
}
}
},
{
"terms": {
"tags.id": {
"value": [ 9, 8 ],
"boost": 2
}
}
}
]
}
}
}
For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.
I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!
I have documents that have a list of labels:
{
"fields": {
"label": [
"foo",
"bar",
"baz"
],
"name": [
"Document One"
],
"description" : "A fine first document",
"id" : 1
}
},
{
"fields": {
"label": [
"foo",
"dog"
],
"name": [
"Document Two"
],
"description" : "A fine second document",
"id" : 2
}
}
I have a list of terms:
[ "foo", "bar", "qux", "zip", "baz"]
I want a query that will return documents that have labels in the list of terms - but no other terms.
So given the list above, the query would return Document One, but not Document Two (because it has the term dog that is not in the list of terms.
I've tried doing a query using a not terms filter, like this:
POST /documents/_search?size=1000
{
"fields": [
"id",
"name",
"label"
],
"filter": {
"not": {
"filter" : {
"bool" : {
"must_not": {
"terms": {
"label": [
"foo",
"bar",
"qux",
"zip",
"baz"
]
}
}
}
}
}
}
}
But that didn't work.
How can I create a query that, given a list of terms, will match documents that only contain terms in the list, and no other terms? In other words, all documents should contain a list of labels that are a subset of the list of supplied terms.
I followed Rohit's suggestion, and implemented an Elasticsearch script filter. You will need to configure your Elasticsearch server to allow dynamic (inline) Groovy scripts.
Here's the code for the Groovy script filter:
def label_map = labels.collectEntries { entry -> [entry, 1] };
def count = 0;
for (def label : doc['label'].values) {
if (!label_map.containsKey(label)) {
return 0
} else {
count += 1
}
};
return count
To use it in an Elasticsearch query, you either need to escape all the newline characters, or place the script on one line like this:
def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count
Here's an Elasticsearch query that's very similar to what I did, including the script filter:
POST /documents/_search
{
"fields": [
"id",
"name",
"label",
"description"
],
"query": {
"function_score": {
"query": {
"filtered": {
"query": {
"bool": {
"minimum_should_match": 1,
"should" : {
"term" : {
"description" : "fine"
}
}
}
},
"filter": {
"script": {
"script": "def label_map = labels.collectEntries { entry -> [entry, 1] }; def count = 0; for (def label : doc['label'].values) { if (!label_map.containsKey(label)) { return 0 } else { count += 1 } }; return count",
"lang": "groovy",
"params": {
"labels": [
"foo",
"bar",
"qux",
"zip",
"baz"
]
}
}
}
}
},
"functions": [
{
"filter": {
"query": {
"match": {
"label": "qux"
}
}
},
"boost_factor": 25
}
],
"score_mode": "multiply"
}
},
"size": 10
}
My actual query required combining the script filter with a function score query, which was hard to figure out how to do, so I'm including it here as an example.
What this does is use the script filter to select documents whose labels are a subset of the labels passed in the query. For my use case (thousands of documents, not millions) this works very quickly - tens of milliseconds.
The first time the script is used, it takes a long time (about 1000 ms), probably due to compilation and caching. But later invocations are 100 times faster.
A couple of notes:
I used the Sense console Chrome plugin to debug the Elasticsearch query. Much better than using curl on the commandline! (Note that Sense is now part of Marvel, so you can also get it there.
To implement the Groovy script, I first installed the Groovy language on my laptop, and wrote some unit tests, and implemented the script. Once I was sure that the script was working, I formatted it to fit on one line and put it into Sense.
You can script filter to check if the array terms has all the values of label array in a document. I suggest you to make a separate groovy file or plain javascript file, put it in config/scripts/folderToYourScript, and use it in your query infilter: { script : {script_file: file } }
While in script file you can use loop to check the requirement