Get group by and distinct count of values using other field in Elasticsearch - elasticsearch

I have an index having document structure as below -
{
"key": ["10", "20"],
"keywords": [
{
'case': 1,
'word': 'abc'
},
{
'case': 2,
'word': 'def'
},
{
'case': 1
'word': abcd
}
]
}
I need to apply filter on key=10 & get the count of distinct words by each case accros the documents. There are 20 disinct cases, so this query will return is 20 buckets at max.
Filter Condition:
key = 10
Expected result set
[
{
'case': 1,
'value': 2
},
{
'case': 2,
'value': 1
}
]
Equivalent SQL Query for this is -
select case, count(distinct words) as value
from <table> where key = 10 and case in (1, 2, 3, 4) group by case;

First map the nested structure as nested datatype in ES index.
Mapping reference here.
{
"query": {
"match": {
"key": "123"
}
},
"aggs": {
"keywords": {
"nested": {
"path": "keywords"
},
"aggs": {
"subjects": {
"terms": {
"field": "keywords.case.keyword"
},
"aggs": {
"count": {
"cardinality": {
"field": "keywords.word.keyword",
"precision_threshold": 4000
}
}
}
}
}
}
},
"size": 0
}

Related

Find all documnents with max value in field

I am new in elasticsearch, it is my first NoSql DB and I have some problem.
I have something like this:
index_logs: [
{
"entity_id": id_1,
"field_name": "name",
"old_value": null
"new_value": "some_name"
"number": 1
},
{
"entity_id": id_1,
"field_name": "description",
"old_value": null
"new_value": "some_descr"
"number": 1
},
{
"entity_id": id_1,
"field_name": "description",
"old_value": "some_descr"
"new_value": null
"number": 2
},
{
"entity_id": id_2,
"field_name": "enabled",
"old_value": true
"new_value":false
"number": 25
},
]
And I need to find all documents with specified entity_id and max_number which I do not know.
In postgres it will be as following code:
SELECT *
FROM logs AS x
WHERE x.entity_id = <some_entity_id>
AND x.number = (SELECT MAX(y.number) FROM logs AS y WHERE y.entity_id = x.entity_id)
How can I do it in elasticsearch?
Use the following query:
{
"query": {
"term": {
"entity_id": {
"value": "1"
}
}
},
"aggs": {
"max_number": {
"terms": {
"field": "number",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
The aggregation will group by the "number", sort by descending order of number, then give you the top 10 results with that number, inside the 'top_hits' subaggregation.
Another way to get "All" the documents is to simply use the same query, withouty any aggregations, and sort descending on the "number" field. On the client side you use pagination with "search_after", and paginate all the results, till the "number" field changes. The first time the number changes, you exit your pagination loop and you have all the records with the given entity id and the max number.

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.
You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

ElasticSearch cardinality aggregation with multiple query

I have a document with merchant and item. my document will look liken
{
"merchant": "M1",
"item": "I1"
}
For the given list of merchant names, I want to get number of unique items on each merchant.
I was able to get number of unique items on a given merchant by following query:
{
"size": 0,
"query": {
"match": {
"merchant": "M1"
}
},
"aggs": {
"count_unique_items": {
"cardinality": {
"field": "I1"
}
}
}
}
Is there a way to expand this query so instead of 1 merchant, I can do search for N merchants with one query?
You need to use terms query to match multiple merchants and use multilevel aggregation to find unique count per merchant. So create a terms aggregation for merchant and then add cardinality aggregation as sub aggregation to the terms aggregation. Query will look like below:
{
"size": 0,
"query": {
"terms": {
"merchant": [
"M1",
"M2"
]
}
},
"aggs": {
"merchent": {
"terms": {
"field": "merchant"
},
"aggs": {
"item_count": {
"cardinality": {
"field": "item"
}
}
}
}
}
}
As suggested by #Opster ES Ninja Nishant, you need to use multilevel aggregation.
Adding a working example with index data,search query, and search result
Index Data:
{
"merchant": "M3",
"item": ["I3","I2"]
}
{
"merchant": "M2",
"item": ["I2","I2"]
}
{
"merchant": "M1",
"item": "I1"
}
Search Query:
To count the unique number of item for a given merchant, in the cardinality aggregation instead of I1, you should use the item field
{
"size":0,
"query": {
"terms": {
"merchant.keyword": [
"M1",
"M2",
"M3"
]
}
},
"aggs": {
"merchent": {
"terms": {
"field": "merchant.keyword"
},
"aggs": {
"item_count": {
"cardinality": {
"field": "item.keyword" <-- note this
}
}
}
}
}
}
Search Result:
"aggregations": {
"merchent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M1",
"doc_count": 1,
"item_count": {
"value": 1
}
},
{
"key": "M2",
"doc_count": 1,
"item_count": {
"value": 1
}
},
{
"key": "M3",
"doc_count": 1,
"item_count": {
"value": 2
}
}
]
}

Nested Aggregation in Nest Elastic Search

In my Elastic document i have CityId,RootId,RootName,Price.Now i have to find top 7 roots in a city with following conditions.
Name and id of root which has minimum price in a City.
top 7 roots:- roots those have max number of entry in a City.
for Example :-
CityId RootId RootName Price
11 1 ABC 90
11 1 ABC 100
11 2 DEF 80
11 2 DEF 90
11 2 DEF 60
answer for CityId =11:-
RootId RootName Price
2 DEF 60
1 ABC 90
I am not aware of the syntax of the Nest. Adding a working example in JSON format.
Index Mapping:
{
"mappings":{
"properties":{
"listItems":{
"type":"nested"
}
}
}
}
Index Data:
{
"RootId": 2,
"CityId": 11,
"RootName": "DEF",
"listItems": [
{
"Price": 60
},
{
"Price": 90
},
{
"Price": 80
}
]
}
{
"RootId": 1,
"CityId": 11,
"RootName": "ABC",
"listItems": [
{
"Price": 100
},
{
"Price": 90
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "RootId"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.Price"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 90.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 3,
"min_position": {
"value": 60.0
}
}
}
]
}
}
.Query(query => query.Bool(bQuery => bQuery.Filter(
fQuery => fQuery.Terms(ter => ter.Field(f => f.CityId).Terms(cityId))
)))
.Aggregations(agg => agg.Terms("group_by_rootId", st => st.Field(o => o.RootId)
.Order(TermsOrder.CountDescending)
.Aggregations(childAgg => childAgg.Min("min_price_in_group", m =>m.Field(p=>p.Price))
.TopHits("stocks", t11 => t11
.Source(sfd => sfd.Includes(fd => fd.Fields(Constants.IncludedFieldsFromElastic)))
.Size(1)
)
)
)
)
.Size(_popularStocksCount)
.From(0)
.Take(0);

Elasticsearch aggregation over children document field values

I'm facing the following problem of selecting and sorting parent documents based on an aggregated value over its children documents. The aggregation (e.g. sum) itself depends on a query string, i.e. which children documents are relevant for the aggregation.
Example: Given the documents basket A and basket B, for each basket document, I am looking to sum over the number field of its fruit children if the name field matches my query, e.g. apples.
PUT /baskets/_doc/0
{
"name": "basket A",
"fruit": [
{
"name": "apples",
"number": 2
},
{
"name": "oranges",
"number": 3
}
]
}
PUT /baskets/_doc/1
{
"name": "basket B",
"fruit": [
{
"name": "apples",
"number": 3
},
{
"name": "apples",
"number": 3
}
]
}
Mappings:
PUT /baskets
{
"mappings": {
"properties": {
"name": { "type": "text" },
"fruit": {
"type": "nested",
"properties": {
"name": { "type": "text" },
"number": { "type": "long" }
}
}
}
}
}
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
How can one implement this using the Elasticsearch (7.8.0) query DSL?
I have tried so far with nested queries and aggregations without success.
Thanks!
Edit: Added mappings
Edit: Updated the numbers to better reflect the problem
*Edit: Added possible answer to Use case 2 (see comments to the answer from #joe):
GET /profiles/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}
Use case 1:
GET baskets/_search
{
"query": {
"nested": {
"path": "fruit",
"inner_hits": {},
"query": {
"bool": {
"must": [
{
"term": {
"fruit.name": {
"value": "apples"
}
}
},
{
"range": {
"fruit.number": {
"gte": 5
}
}
}
]
}
}
}
}
}
Strictly more than 5 --> gt; >=5 --> gte.
Also notice the inner_hits part -- this gives you the actual nested subdocument which caused this particular basket to match the query. It's not required but good-to-know.
Use case 2:
GET baskets/_search
{
"sort": [
{
"fruit.number": {
"nested_path": "fruit",
"order": "desc"
}
}
]
}
Use case 2 Edit:
There are probably cleaner ways of doing this but I'd go with the following:
GET baskets/_search
{
"size": 0,
"aggs": {
"multiply_and_add": {
"scripted_metric": {
"params": {
"only_fruit_name": "apples"
},
"init_script": "state.by_basket_name = [:]",
"map_script": """
def basket_name = params._source['name'];
def fruits = params._source['fruit'].findAll(group -> group.name == params.only_fruit_name);
for (def fruit_group : fruits) {
def number = fruit_group.number;
if (state.by_basket_name.containsKey(basket_name)) {
state.by_basket_name[basket_name] += number;
} else {
state.by_basket_name[basket_name] = number;
}
}
""",
"combine_script": "return state.by_basket_name",
"reduce_script": "return states"
}
}
}
}
yielding a hash map along the lines of
{
...
"aggregations":{
"multiply_and_add":{
"value":[
{
"basket A":2,
"basket B":6
}
]
}
}
}
Sorting can either be done in the reduce_script or within your ES response post-processing pipeline. You could of course choose to go w/ (sorted) lists and lambdas...
Notice the required nested_path.
After a while of searching and testing, here are (in addition to #joe's answer to use case 2) possible queries for both use cases. Note that both use cases require to change the mapping for the field name to be of type keyword.
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
For more information on filtering results by their aggregation value see Bucket Selectors
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name"
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"match": {"fruit.name": "apples"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
},
"basket_sum_filter":{
"bucket_selector":{
"buckets_path":{
"fruitSum":"nest > fruit_filter > fruit_sum"
},
"script":"params.fruitSum > 5"
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
}
]
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
},
{
"key": "basket A",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 1,
"fruit_sum": {
"value": 2
}
}
}
}
]

Resources