Elasticsearch metric aggregation: number of elements in array - elasticsearch

I want to do a quite involved query/aggregation. I can't see how because I've just started working with ES. The documents I have look something like this:
{
"keyword": "some keyword",
"items": [
{
"name":"my first item",
"item_property_1":"A",
( other properties here )
},
{
"name":"my second item",
"item_property_1":"B",
( other properties here )
},
{
"name":"my third item",
"item_property_1":"A",
( other properties here )
}
]
( other properties... )
},
{
"keyword": "different keyword",
"items": [
{
"name":"cool item",
"item_property_1":"A",
( other properties here )
},
{
"name":"awesome item",
"item_property_1":"C",
( other properties here )
},
]
( other properties... )
},
( other documents... )
Now, what I would like to do is to, for each keyword, count how many items there are for which of the several possible values that property_1 can have. That is, I want a bucket aggregation that would have the following response:
{
"keyword": "some keyword",
"item_property_1_aggretation": [
{
"key":"A",
"count": 2,
},
{
"key":"B",
"count": 1,
}
]
},
{
"keyword": "different keyword",
"item_property_1_aggretation": [
{
"key":"A",
"count": 1,
},
{
"key":"C",
"count": 1,
}
]
},
( other keywords... )
If mappings are necessary, could you also specificy which? I don't have any non-default mappings, I just dumped everything in there.
EDIT:
Saving you the trouble by posting here the bulk PUT for the previous example
PUT /test/test/_bulk
{ "index": {}}
{ "keyword": "some keyword", "items": [ { "name":"my first item", "item_property_1":"A" }, { "name":"my second item", "item_property_1":"B" }, { "name":"my third item", "item_property_1":"A" } ]}
{ "index": {}}
{ "keyword": "different keyword", "items": [ { "name":"cool item", "item_property_1":"A" }, { "name":"awesome item", "item_property_1":"C" } ]}
EDIT2:
I just tried this:
POST /test/test/_search
{
"size":2,
"aggregations": {
"property_1_count": {
"terms":{
"field":"item_property_1"
}
}
}
}
and got this:
"aggregations": {
"property_1_count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a",
"doc_count": 2
},
{
"key": "b",
"doc_count": 1
},
{
"key": "c",
"doc_count": 1
}
]
}
}
close but no cigar. You can see what's happening, it's bucketing over each item_property_1 irrespectively of the keyword it belongs to. I'm sure the solution involves adding some mapping correctly, but I can't put my finger on it. Suggestions?
EDIT3:
Based on this:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-nested-type.html
I want to try adding a nested type to property items. To do that, I tried:
PUT /test/_mapping/test
{
"test":{
"properties": {
"items": {
"type": "nested",
"properties": {
"item_property_1":{"type":"string"}
}
}
}
}
}
However, this returns an error:
{
"error": "MergeMappingException[Merge failed with failures {[object mapping [items] can't be changed from non-nested to nested]}]",
"status": 400
}
This might have to do with the warning on that url: "changing an object type to nested type requires reindexing."
So, how do I do that?

Nice tries, you were almost there! Here is what I came up with. Based on your mapping proposal, the mapping I'm using is the following:
curl -XPUT localhost:9200/test/_mapping/test -d '{
"test": {
"properties": {
"keyword": {
"type": "string",
"index": "not_analyzed"
},
"items": {
"type": "nested",
"properties": {
"name": {
"type": "string"
},
"item_property_1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
Note: you need to wipe and reindex your data, since you cannot change a field type from being not nested to nested.
Then I created some data with the bulk query you shared:
curl -XPOST localhost:9200/test/test/_bulk -d '
{ "index": {}}
{ "keyword": "some keyword", "items": [ { "name":"my first item", "item_property_1":"A" }, { "name":"my second item", "item_property_1":"B" }, { "name":"my third item", "item_property_1":"A" } ]}
{ "index": {}}
{ "keyword": "different keyword", "items": [ { "name":"cool item", "item_property_1":"A" }, { "name":"awesome item", "item_property_1":"C" } ]}
'
Finally, here is the aggregation query you can use to get the results you expect. We first bucket by keyword using a terms aggregation and then for each keyword, we bucket by the nested item_property_1 field. Since items is now a nested type, the key is to use a nested aggregation for items and then a terms sub-aggregation for the item_property_1 field.
{
"size": 0,
"aggregations": {
"by_keyword": {
"terms": {
"field": "keyword"
},
"aggs": {
"prop_1_count": {
"nested": {
"path": "items"
},
"aggs": {
"prop_1": {
"terms": {
"field": "items.item_property_1"
}
}
}
}
}
}
}
}
Running that query on your data set will yield this:
{
...
"aggregations" : {
"by_keyword" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "different keyword", <---- keyword 1
"doc_count" : 1,
"prop_1_count" : {
"doc_count" : 2,
"prop_1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ { <---- buckets for item_property_1
"key" : "A",
"doc_count" : 1
}, {
"key" : "C",
"doc_count" : 1
} ]
}
}
}, {
"key" : "some keyword", <---- keyword 2
"doc_count" : 1,
"prop_1_count" : {
"doc_count" : 3,
"prop_1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ { <---- buckets for item_property_1
"key" : "A",
"doc_count" : 2
}, {
"key" : "B",
"doc_count" : 1
} ]
}
}
} ]
}
}
}

Related

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

ElasticSearch Max Agg on lowest value inside a list property of the document

I'm looking to do a Max aggregation on a value of the property under my document, the property is a list of complex object (key and value). Here's my data:
[{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
},
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}]
When I do the Nested Max Aggregation on "listItems.value", I'm expecting the max value returned to be 200 (and not 5000), reason being I want the logic to first figure the MIN value under listItems for each document, then doing the Max Aggregation on that. Is it possible to do something like this?
Thanks.
The search query performs the following aggregation :
Terms aggregation on the id field
Min aggregation on listItems.value
Max bucket aggregation that is a sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s).
Please refer to nested aggregation, to get a detailed explanation on it.
Adding a working example with index data, index mapping, search query, and search result.
Index Mapping:
{
"mappings": {
"properties": {
"listItems": {
"type": "nested"
},
"id":{
"type":"text",
"fielddata":"true"
}
}
}
}
Index Data:
{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
}
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.value"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": "2",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 200.0
}
}
}
]
},
"maxValue": {
"value": 200.0,
"keys": [
"2"
]
}
}
Initial post was mentioning nested aggregation, thus i was sure question is about nested documents. Since i've come to solution before seeing another answer, i'm keeping the whole thing for history, but actually it differs only in adding nested aggregation.
The whole process can be explained like that:
Bucket each document into single bucket.
Use nested aggregation to be able to aggregate on nested documents.
Use min aggregation to find minimum value within all document nested documents, and by that, for document itself.
Finally, use another aggregation to calculate maximum value among results of previous aggregation.
Given this setup:
// PUT /index
{
"mappings": {
"properties": {
"children": {
"type": "nested",
"properties": {
"value": {
"type": "integer"
}
}
}
}
}
}
// POST /index/_doc
{
"children": [
{ "value": 12 },
{ "value": 45 }
]
}
// POST /index/_doc
{
"children": [
{ "value": 7 },
{ "value": 35 }
]
}
I can use those aggregations in request to get required value:
{
"size": 0,
"aggs": {
"document": {
"terms": {"field": "_id"},
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"minimum": {
"min": {
"field": "children.value"
}
}
}
}
}
},
"result": {
"max_bucket": {
"buckets_path": "document>children>minimum"
}
}
}
}
{
"aggregations": {
"document": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O4QxyHQBK5VO9CW5xJGl",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 7.0
}
}
},
{
"key": "OoQxyHQBK5VO9CW5kpEc",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 12.0
}
}
}
]
},
"result": {
"value": 12.0,
"keys": [
"OoQxyHQBK5VO9CW5kpEc"
]
}
}
}
There also should be a workaround using script for calculating max - all that you will need to do is just find and return smallest value in document in such script.

Combine results of multiple aggregations

I have movies index in which each document has this structure :
Document :
{
"color": "Color",
"director_name": "Sam Raimi",
"actor_2_name": "James Franco",
"movie_title": "Spider-Man 2",
"actor_3_name" : "Brad Pitt",
"actor_1_name": "J.K. Simmons"
}
I need to do calculate number of movies corresponding to each actor (actor can be in both actor_1_name or actor_2_name or actor_3_name field)
Mapping of these 3 fields is :
Mapping
"mappings": {
"properties": {
"actor_1_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"actor_2_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"actor_3_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Is there a way I can aggregated result which can combine terms from all 3 actor fields and give a single aggreagation.
Currently I am creating separate aggregation for each actor field and through my JAVA code combine these different aggregations into one.
Search Query by creating different aggregation :
Search Query :
{
"aggs" : {
"actor1_count" : {
"terms" : {
"field" : "actor_1_name.keyword"
}
},
"actor2_count" : {
"terms" : {
"field" : "actor_2_name.keyword"
}
},
"actor3_count" : {
"terms" : {
"field" : "actor_3_name.keyword"
}
}
}
}
Result
Sample Result is :
"aggregations": {
"actor1_count": {
"buckets": [
{
"key": "Johnny Depp",
"doc_count": 2
}
]
},
"actor2_count": {
"buckets": [
{
"key": "Johnny Depp",
"doc_count": 1 }
]
},
"actor3_count": {
"buckets": [
{
"key": "Johnny Depp",
"doc_count": 3
}
]
}
}
So, is it possible instead of creating different aggregation , I can combine result of all 3 aggregation in one aggreation through Elasticsearch.
Basically this is I want :
"aggregations": {
"actor_count": {
"buckets": [
{
"key": "Johnny Depp",
"doc_count": 6
}
]
}
}
(Johnny Depp doc_count should show sum from all 3 field actor_1_name, actor_2_name, actor_3_name wherever it is present)
I have tried though script but it dint worked correctly .
Script Query :
{
"aggregations": {
"name": {
"terms": {
"script": "doc['actor_1_name.keyword'].value + ' ' + doc['actor_2_name.keyword'].value + ' ' + doc['actor_2_name.keyword'].value"
}
}
}
}
It is combining actor names and then giving result .
Result :
"buckets": [
{
"key": "Steve Buscemi Adam Sandler Adam Sandler",
"doc_count": 6
},
{
"key": "Leonard Nimoy Nichelle Nichols Nichelle Nichols",
"doc_count": 4
}
]
This is not going to work w/ terms. Gotta resort to scripted_metric, I think:
GET actors/_search
{
"size": 0,
"aggs": {
"merged_actors": {
"scripted_metric": {
"init_script": "state.actors_map=[:]",
"map_script": """
def actor_keys = ['actor_1_name', 'actor_2_name', 'actor_3_name'];
for (def key : actor_keys) {
def actor_name = doc[key + '.keyword'].value;
if (state.actors_map.containsKey(actor_name)) {
state.actors_map[actor_name] += 1;
} else {
state.actors_map[actor_name] = 1;
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
yielding
...
"aggregations" : {
"merged_actors" : {
"value" : [
{
"actors_map" : {
"Brad Pitt" : 5,
"J.K. Simmons" : 1,
"James Franco" : 3
}
}
]
}
}

Counting non-unique items in an Elasticsearch aggregation?

I'm trying to use an Elasticsearch aggregation to return all non-unique counts for each term within a bucket.
Given a mapping:-
{
"properties": {
"addresses": {
"properties": {
"meta": {
"properties": {
"types": {
"properties": {
"type": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
And a document:-
{
"id": 3,
"first_name": "James",
"last_name": "Smith",
"addresses": [
{
"meta": {
"types": [
{
"type": "Home"
},
{
"type": "Home"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Fax"
}
]
}
}
]
}
The following terms aggregation:-
GET /test/_search
{
"size": 0,
"query": {
"match": {
"id": 3
}
},
"aggs": {
"types": {
"terms": {
"field": "addresses.meta.types.type"
}
}
}
}
Gives this result:-
"aggregations" : {
"types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Business",
"doc_count" : 1
},
{
"key" : "Fax",
"doc_count" : 1
},
{
"key" : "Home",
"doc_count" : 1
}
]
}
}
As you can see the terms are unique and I'm really after a total count of each e.g. Home: 2, Business: 3 and Fax: 1.
Is this possible?
I had a look at value_count but as it's not a bucket aggregation it seems a little less convenient to use. Alternatively possible a script might do it but I'm not too sure on the syntax.
Thanks!
I doubt if that is possible using object type in Elasticsearch. The reason is that most of the metrics aggregations is w.r.t the count of documents for particular occurrence of word and not counts of occurrence of words in documents.
You may have to change the type of your field type to nested so that ES would end up saving each type inside types as separate document.
I've provided sample mapping, document(no change in representation), aggregation query and response below.
Sample Mapping:
PUT nested_test
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"first_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"second_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"addresses":{
"properties":{
"meta":{
"properties":{
"types":{
"type":"nested", <----- Note this
"properties":{
"type":{
"type":"keyword"
}
}
}
}
}
}
}
}
}
}
Sample Document (No change)
POST nested_test/_doc/1
{
"id": 3,
"first_name": "James",
"last_name": "Smith",
"addresses": [
{
"meta": {
"types": [
{
"type": "Home"
},
{
"type": "Home"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Fax"
}
]
}
}
]
}
Note that every type above is now considered as a separate document linked to the main document.
Aggregation Query:
All that would be required is to make use of Nested Aggregation + Terms Aggregation
POST nested_test/_search
{
"size": 0,
"aggs": {
"myterms": {
"nested": {
"path": "addresses.meta.types"
},
"aggs": {
"myterms": {
"terms": {
"field": "addresses.meta.types.type",
"size": 10,
"min_doc_count": 2 <----- Note this to filter only values with non unique counts
}
}
}
}
}
}
Note that in the above query I've made use of min_doc_count in order to restrict the results as per what you are looking for.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"myterms" : {
"doc_count" : 6,
"myterms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Business",
"doc_count" : 3
},
{
"key" : "Home",
"doc_count" : 2
}
]
}
}
}
}
Hope that helps!

How to use value of nested documents in script scoring

Schema looks like this:
"mappings": {
"_doc": {
"_all": {
"enabled": false
},
"properties": {
"category_boost": {
"type": "nested",
"properties" : {
"category": {
"type": "text",
"index": false
},
"boost": {
"type": "integer",
"index": false
}
}
}
}
}
}
The document in elastic does have data:
"category_boost": [
{
"category": "A",
"boost": 98
},
{
"category": "B",
"boost": 96
},
{
"category": "C",
"boost": 94
},
],
Inside scoring function:
for (int i=0; i<doc['"'category_boost.boost'"'].size(); ++i) {
if (doc['"'category_boost.category'"'][i].value.equals(params.category)) {
boost = doc['"'category_boost.boost'"'][i].value;
}
}
Also tried length to get size of the array, but did help. Since it does not affect results, I tried to divide by size() and it throws division by zero error, so I conclude the size is 0.
Overall problem: have a map of category->boost which is dynamic and I cannot hardcode into schema. I tried type object with json object, but it turned out you cannot access those objects in scoring functions, therefore I went with arrays with defined types.
nested datatype create sub-documents for representing the items of your collections. So access their doc values in a script is possible but you need to be inside a nested query.
Here is one way of doing it, I hope it fulfills your requirements. This example only returns the document with a score depending on the chosen category.
NB : I used elasticsearch 7 in my local, so your will have to modify the mapping to add your "_doc" entry etc....
Here is the modified mapping, I removed the index: false in nested properties since we now use them in queries
PUT test-score_nested
{
"mappings": {
"properties": {
"category_boost": {
"type": "nested",
"properties": {
"category": {
"type": "keyword"
},
"boost": {
"type": "integer"
}
}
}
}
}
}
Then I add your sample data :
POST test-score_nested/_doc
{
"category_boost": [
{
"category": "A",
"boost": 98
},
{
"category": "B",
"boost": 96
},
{
"category": "C",
"boost": 94
}
]
}
And then the query.
We go one level deep in the nested collection
Inside the collection we use a function score query with the replace mode
Inside the function score, we use a filter query to "select" the good category and use its boost for the scoring
POST test-score_nested/_search
{
"query": {
"nested": {
"path": "category_boost",
"query": {
"function_score": {
"boost_mode": "replace",
"query": {
"term": {
"category_boost.category": {
"value": "A"
}
}
},
"functions": [
{
"field_value_factor": {
"field": "category_boost.boost"
}
}
]
}
}
}
}
}
returns
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 98.0,
"hits" : [
{
"_index" : "test-score_nested",
"_type" : "_doc",
"_id" : "v3Smqm0BZ7nyeX7PPevA",
"_score" : 98.0,
"_source" : {
"category_boost" : [
{
"category" : "A",
"boost" : 98
},
{
"category" : "B",
"boost" : 96
},
{
"category" : "C",
"boost" : 94
}
]
}
}
]
}
}
I hope it will help you!

Resources