Range filter for count of documents with the same value for a field - elasticsearch

In my index my-books, each document represents a book and has a field authorId, which uniquely represents the author of the book. I want to run a search query with a range filter on the total number of books authored by the book's author.
For example: say, if I have four authors A, B, C, D.
A is the author for books a1, a2,a3.
B is the author for book b1.
C is the author for books c1,c2.
D is the author for books d1, d2, d3, d4.
Lets say I want to retrieve all books such as the number of books written by the same author is greater than 1 but less than 4. Then my result hits are [a1, a2, a3, c1, c2].
How do I write such a query?

You need to use
terms aggregation to group by authors
top_hits to get documents under that author
bucket_selector to get terms where doc count is less than 4
{
"aggs": {
"NAME": {
"terms": {
"field": "author.keyword",
"size": 10
},
"aggs": {
"books": {
"top_hits": {
"size": 10
}
},
"final_filter": {
"bucket_selector": {
"buckets_path": {
"values": "_count"
},
"script": "params.values < 4"
}
}
}
}
}
}
Result
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A",
"doc_count" : 2,
"books" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "-_pOUHoBVZyA6L_G1XrM",
"_score" : 1.0,
"_source" : {
"book" : "a1",
"author" : "A"
}
},
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_PpPUHoBVZyA6L_GL3q5",
"_score" : 1.0,
"_source" : {
"book" : "a3",
"author" : "A"
}
}
]
}
}
},
{
"key" : "B",
"doc_count" : 1,
"books" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_fpPUHoBVZyA6L_GWHpg",
"_score" : 1.0,
"_source" : {
"book" : "b1",
"author" : "B"
}
}
]
}
}
},
{
"key" : "C",
"doc_count" : 1,
"books" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_vpPUHoBVZyA6L_Gmnoj",
"_score" : 1.0,
"_source" : {
"book" : "c1",
"author" : "C"
}
}
]
}
}
}
]
}
}

Related

Elasticsearch returning wrong results upon query

I am new to ElasticSearch and was doing some experiments to learn but I figured out that _search query is returning wrong results. I inserted documents to index by using following code
PUT tryDB/_doc/2
{"personId":"2","minor":true,"money":15 }
PUT tryDB/_doc/3
{"personId":"3","minor":true,"money":20 }
PUT tryDB/_doc/4
{"personId":"4","minor":true,"money":25 }
PUT tryDB/_doc/5
{"personId":"5","minor":true,"money":30 }
PUT tryDB/_doc/6
{"personId":"6","minor":true,"money":35 }
PUT tryDB/_doc/7
{"personId":"7","minor":true,"money":40 }
PUT tryDB/_doc/8
{"personId":"8","minor":true,"money":45 }
PUT tryDB/_doc/9
{"personId":"9","minor":true,"money":55 }
PUT tryDB/_doc/10
{"personId":"10","minor":true,"money":60 }
PUT tryDB/_doc/11
{"personId":"11","minor":true,"money":65 }
PUT tryDB/_doc/12
{"personId":"12","minor":true,"money":70 }
PUT tryDB/_doc/13
{"personId":"2","minor":false,"money":80 }
PUT tryDB/_doc/14
{"personId":"2","minor":false,"money":90 }
PUT tryDB/_doc/15
{"personId":"2","minor":false,"money":100 }
PUT tryDB/_doc/16
{"personId":"2","minor":false,"money":10 }
After which I fired up a GET tryDB/_search query to list all the documents, which in turn returns
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 16,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "tryDB",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"personId" : "1",
"minor" : true,
"money" : 10
}
},
{
"_index" : "tryDB",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"personId" : "2",
"minor" : true,
"money" : 15
}
},
{
"_index" : "tryDB",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"personId" : "3",
"minor" : true,
"money" : 20
}
},
{
"_index" : "tryDB",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"personId" : "4",
"minor" : true,
"money" : 25
}
},
{
"_index" : "tryDB",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"personId" : "5",
"minor" : true,
"money" : 30
}
},
{
"_index" : "tryDB",
"_id" : "6",
"_score" : 1.0,
"_source" : {
"personId" : "6",
"minor" : true,
"money" : 35
}
},
{
"_index" : "tryDB",
"_id" : "7",
"_score" : 1.0,
"_source" : {
"personId" : "7",
"minor" : true,
"money" : 40
}
},
{
"_index" : "tryDB",
"_id" : "8",
"_score" : 1.0,
"_source" : {
"personId" : "8",
"minor" : true,
"money" : 45
}
},
{
"_index" : "tryDB",
"_id" : "9",
"_score" : 1.0,
"_source" : {
"personId" : "9",
"minor" : true,
"money" : 55
}
},
{
"_index" : "tryDB",
"_id" : "10",
"_score" : 1.0,
"_source" : {
"personId" : "10",
"minor" : true,
"money" : 60
}
}
]
}
}
Where are the rest 6 documents ?
Now I went ahead and fired up a range based query
GET tryDB/_search
{
"query": {
"range": {
"money": {
"lte":100
}
}
}
}
Which in turn returned
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "tryDB",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"personId" : "1",
"minor" : true,
"money" : 10
}
},
{
"_index" : "tryDB",
"_id" : "15",
"_score" : 1.0,
"_source" : {
"personId" : "2",
"minor" : false,
"money" : 100
}
},
{
"_index" : "tryDB",
"_id" : "16",
"_score" : 1.0,
"_source" : {
"personId" : "2",
"minor" : false,
"money" : 10
}
}
]
}
}
Which is wrong clearly. Can anyone help me figure out what's going on here?
Where are the rest 6 documents ?
When you do not determine the value of "size", by default elastic returns 10 documents.
Set size like this:
{
"size": 20,
"query": {
"match_all": {}
}
}
POST tryDB/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"money": {
"lte": 100
}
}
}
]
}
}
}
#rabbitbr Thanks for the quick response!
Hey I figured out the solution (posting here)
Based on the result,
Looks like Elastic Search index money as string.
I tried setting up an explicit mapping to make sure the money field indexed as number.
https://opensearch.org/docs/1.3/opensearch/mappings/
This worked out.

On Elasticsearch, how to aggregate based on the number of items in a field?

On Elasticsearch I have a field named Itinerary that can contain multiple values (from 1 up to 6), for example in the picture below there's 2 items in the field.
"Itinerary": [
{
"Carrier": "LH",
"Departure": "2021-07-04T06:55:00Z",
"Number": "1493",
"Arrival": "2021-07-04T08:40:00Z",
},
{
"Carrier": "LH",
"Departure": "2021-07-04T13:30:00Z",
"Number": "422",
"Arrival": "2021-07-04T16:05:00Z",
}
}
]
Is there a way I can aggregate based on the number of items in the field? Having something like:
1 item : 2
2 item : 4
...
Itinerary type needs to be define as nested type
"Itinerary":
{
"type": "nested"
}
Terms aggregation to group on a field. You can use script to get count of array or better introduce a field which has count of array
Top hits aggregation to get documents under that group
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"script": {
"source": "doc['Itinerary.Carrier.keyword'].length"
}
},
"aggs": {
"NAME": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Result:
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 2,
"NAME" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8OW1lnsBRh1xpgSkIOlq",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "1493",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "422",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
},
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8uW6lnsBRh1xpgSkAun1",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH2",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "14931",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH2",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "4221",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
}
]
}
}
},
{
"key" : 3,
"doc_count" : 1,
"NAME" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8eW1lnsBRh1xpgSkdukQ",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "14931",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "4221",
"Arrival" : "2021-07-04T16:05:00Z"
},
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "3221",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
}
]
}
}
}
]
}
}

is there a way of showing documents after a sum aggregation?

I've been trying lately to retrieve information about sales on Kibana DSL.
I've been told to show vendors information PLUS their monthly sales.
(I'll use the "Kibana_sample_data_ecommerce" for this example)
I already did this aggregation in order to group all clients by their 'customer_id':
#Aggregations (group by)
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"by user_id": {
"terms": {
"field": "customer_id"
},
"aggs": {
"add_field_to_bucket": {
"top_hits": {"size": 1, "_source": {"includes": ["customer_full_name"]}}
}
}
}
}
}
in which i've included customer_full_name in the result:
"aggregations" : {
"by user_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2970,
"buckets" : [
{
"key" : "27",
"doc_count" : 348,
"add_field_to_bucket" : {
"hits" : {
"total" : 348,
"max_score" : 1.0,
"hits" : [
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "fhwUR3sBpfDKGuVlpu8r",
"_score" : 1.0,
"_source" : {
"customer_full_name" : "Elyssa Underwood"
}
}
]
}
}
}
So, in this result i know that 'Elyssa Underwood' with 'customerid' '27' has 348 hits (or documents related).
Also i recquire to know the total spent by 'Elyssa' on those products, using the field 'products.taxful_price'.
The thing is that i cannot perform a subaggregation on top_hits (as far as i know); Also I've tried to do a sum_aggregation, but it ends on the same result (i got my sum, but i cannot access top_hits sub aggregation at that point).
At the end of the day i want to have a result like this:
"hits" : [
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "fhwUR3sBpfDKGuVlpu8r",
"_score" : 1.0,
"_source" : {
"customer_full_name" : "Elyssa Underwood",
"total_spent": 1234.5678
}
}
]
Is there something I can do to achieve it?.
PS: I'm using ElasticSearch 5.x and also I have access to NEST client, if there's a solution I can reach through it.
Thanks In Advance.
I have used below as sample data.
Data:
{
"customer_id":2,
"client-name":"b",
"purchase": 2001
}
Query:
GET index/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "customer_id",
"size": 10
},
"aggs": {
"total_sales": {
"sum": {
"field": "purchase"
}
},
"documents":{
"top_hits": {
"size": 10
}
}
}
}
}
}
Result:
{
"key" : 2,
"doc_count" : 1,
"documents" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "0HPzcHsBjw4ziwrzGzrq",
"_score" : 1.0,
"_source" : {
"customer_id" : 2,
"client-name" : "b",
"purchase" : 2001
}
}
]
}
},
"total_sales" : {
"value" : 2001.0
}
}

Search documents with highest fields

I'm trying to get all the documents with highest field value (+ conditional term filter)
Given the Employees mapping
Name Department Salary
----------------------------
Tomcat Dev 100
Bobcat QA 90
Beast QA 100
Tom Dev 100
Bob Dev 90
In SQL it would look like
select * from Employees where Salary = select max(salary) from Employees
expected output
Name Department Salary
----------------------------
Tomcat Dev 100
Beast QA 100
Tom Dev 100
and
select * from Employees where Salary = (select max(salary) from Employees where Department ='Dev' )
expected output
Name Department Salary
----------------------------
Tomcat Dev 100
Tom Dev 100
Is it possible with Elasticsearch ?
The below should help:
Looking at your data, note that I've come up with the below mapping:
Mapping:
PUT my-salary-index
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"department":{
"type": "keyword"
},
"salary":{
"type": "float"
}
}
}
}
Sample Documents:
POST my-salary-index/_doc/1
{
"name": "Tomcat",
"department": "Dev",
"salary": 100
}
POST my-salary-index/_doc/2
{
"name": "Bobcast",
"department": "QA",
"salary": 90
}
POST my-salary-index/_doc/3
{
"name": "Beast",
"department": "QA",
"salary": 100
}
POST my-salary-index/_doc/4
{
"name": "Tom",
"department": "Dev",
"salary": 100
}
POST my-salary-index/_doc/5
{
"name": "Bob",
"department": "Dev",
"salary": 90
}
Solutions:
Scenario 1: Return all employees with max salary
POST my-salary-index/_search
{
"size": 0,
"aggs": {
"my_employees_salary":{
"terms": {
"field": "salary",
"size": 1, <--- Note this
"order": {
"_key": "desc"
}
},
"aggs": {
"my_employees": {
"top_hits": { <--- Note this. Top hits aggregation
"size": 10
}
}
}
}
}
}
Note that I've made use of Terms Aggregation with Top Hits aggregation chained to it. I'd suggest to go through the links to understand both the aggregations.
So basically you just need to retrieve the first element in the Terms Aggregation that is why I've mentioned the size: 1. Also note the order, just in case if you requirement to retrieve the lowest.
Scenario 1 Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_employees" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2,
"buckets" : [
{
"key" : 100.0,
"doc_count" : 3,
"employees" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "Tomcat",
"department" : "Dev",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "Beast",
"department" : "QA",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"name" : "Tom",
"department" : "Dev",
"salary" : 100
}
}
]
}
}
}
]
}
}
}
Scenario 2: Return all employee with max salary from particular department
POST my-salary-index/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"department": "Dev"
}
}
]
}
},
"aggs": {
"my_employees_salary":{
"terms": {
"field": "salary",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"my_employees": {
"top_hits": {
"size": 10
}
}
}
}
}
}
For this, there are many ways to do this, but the idea is that you basically filter the documents before you apply aggregation on top of it. That way it would be more efficient.
Note that I'v just added a bool condition to the aggregation query mentioned in solution for Scenario 1.
Scenario 2 Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_employees_salary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : 100.0,
"doc_count" : 2,
"my_employees" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.53899646,
"hits" : [
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.53899646,
"_source" : {
"name" : "Tomcat",
"department" : "Dev",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.53899646,
"_source" : {
"name" : "Tom",
"department" : "Dev",
"salary" : 100
}
}
]
}
}
}
]
}
}
}
You can also think of making use of SQL Access if you have complete xpack or rather licensed version of x-pack.
Hope this helps.

Using index sorting by default in _search

I am using ElasticSearch 7.6 and the Index Sorting feature which was introduced in 6.0.
What i am looking to do is to do a GET /myindice/_search without specifying sort and get documents based on Index sorting settings I have specified for my index and NOT insertion order.
My index as per the doc :
PUT twitter
{
"settings" : {
"index" : {
"sort.field" : "date",
"sort.order" : "desc"
}
},
"mappings": {
"properties": {
"date": {
"type": "date"
}
}
}
}
PUT twitter/_doc/a
{
"date": "2015-01-01"
}
PUT twitter/_doc/b
{
"date": "2016-01-01"
}
PUT twitter/_doc/c
{
"date": "2017-01-01"
}
My initial thought is that
GET twitter/_search
Should return doc C, B and A.
I get the following :
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "a",
"_score" : 1.0,
"_source" : {
"date" : "2015-01-01"
}
},
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "b",
"_score" : 1.0,
"_source" : {
"date" : "2016-01-01"
}
},
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "c",
"_score" : 1.0,
"_source" : {
"date" : "2017-01-01"
}
}
]
}
}
As the documentation isn't clear at this particular subject and that all query are using sort :
https://www.elastic.co/guide/en/elasticsearch/reference/6.0/index-modules-index-sorting.html
Am I required to specify the sort order in the GET query (hence repeating the sort specified as the Index Sorting) ?
Thanks in advance for any diligent soul that could help me,

Resources