Group by terms and get count of nested array property? - elasticsearch

I would like to get the count from a document series where an array item matches some value.
I have documents like these:
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 10
},{
"State": "PENDING"
"Timer": 5
}]
}
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 5
},{
"State": "PENDING"
"Timer": 2
}]
}
{
"Name": "martin",
"Todos": [{
"State": "COMPLETED"
"Timer": 15
},{
"State": "PENDING"
"Timer": 10
}]
}
I would like to count how many documents I have where they have any Todos with COMPLETED State. And group by Name.
So from the above I would need to get:
jason: 2
martin: 1
Usually I do this with a term aggregation for the Name, and an other sub aggregation for other items:
"aggs": {
"statistics": {
"terms": {
"field": "Name"
},
"aggs": {
"test": {
"filter": {
"bool": {
"must": [{
"match_phrase": {
"SomeProperty.keyword": {
"query": "THEVALUE"
}
}
}
]
}
},
But not sure how to do this here as I have items in an array.

Elasticsearch has no problem with arrays because in fact it flattens them by default:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So a query like the one you posted will do. I would use term query for keyword datatype, though:
POST mytodos/_search
{
"size": 0,
"aggs": {
"by name": {
"terms": {
"field": "Name"
},
"aggs": {
"how many completed": {
"filter": {
"term": {
"Todos.State": "COMPLETED"
}
}
}
}
}
}
}
I am assuming your mapping looks something like this:
PUT mytodos/_mappings
{
"properties": {
"Name": {
"type": "keyword"
},
"Todos": {
"properties": {
"State": {
"type": "keyword"
},
"Timer": {
"type": "integer"
}
}
}
}
}
The example documents that you posted will be transformed internally into something like this:
{
"Name": "jason",
"Todos.State": ["COMPLETED", "PENDING"],
"Todos.Timer": [10, 5]
}
However, if you need to query for Todos.State and Todos.Timer, for example, filter for those "COMPLETED" but only with Timer > 10, it will not be possible with such mapping because Elasticsearch forgets the link between fields of object array items.
In this case you would need to use something like nested datatype for such arrays, and query them with special nested query.
Hope that helps!

Related

Bucket sort in composite aggregation?

How can I do Bucket Sort in composite Aggregation?
I need to do Composite Aggregation with Bucket sort.
I have tried Sort with aggregation.
I have tried composite aggregation.
I think this question, is in continuation to your previous question, so considered the same use case
You need to use Bucket sort aggregation that is a parent pipeline
aggregation which sorts the buckets of its parent multi-bucket
aggregation. And please refer to this documentation on composite
aggregation to know more about this.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
The size parameter can be set to define how many composite buckets
should be returned. Each composite bucket is considered as a single
bucket, so setting a size of 10 will return the first 10 composite
buckets created from the values source. The response contains the
values for each composite bucket in an array containing the values
extracted from each value source. Defaults to 10.
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 3, <-- note this
"sources": [
{
"product": {
"terms": {
"field": "user"
}
}
}
]
},
"aggs": {
"mySort": {
"bucket_sort": {
"sort": [
{
"sort_user": {
"order": "desc"
}
}
]
}
},
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"my_buckets": {
"after_key": {
"product": "user3"
},
"buckets": [
{
"key": {
"product": "user3"
},
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
},
{
"key": {
"product": "user1"
},
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": {
"product": "user2"
},
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
}
]
}

How to get the average count of missing field per document with Elasticsearch?

Shortly: with Elasticsearch, given a list of fields, how can I get the average number of missing fields per document as an aggregation?
Details
With the missing aggregation type I can get the total number of documents where a given field is missing. So with the following data:
"hits": [{
"name": "A name",
"nickname": "A nickname",
"bestfriend": "A friend",
"hobby": "An hobby"
},{
"name": "A name",
"hobby": "An hobby"
},{
"name": "A name",
"nickname": "A nickname",
"hobby": "An hobby"
},{
"name": "A name",
"bestfriend": "A friend"
}]
I can run the following query:
{
"aggs": {
"name_missing": {
"missing": {"field": "name"}
},
"nickname_missing": {
"missing": {"field": "nickname"}
},
"hobby_missing": {
"missing": {"field": "hobby"}
},
"bestfriend_missing": {
"missing": {"field": "bestfriend"}
}
}
}
And I get the following aggregations:
...
"aggregations": {
"name_missing": {
"doc_count": 0
},
"nickname_missing": {
"doc_count": 2
},
"hobby_missing": {
"doc_count": 1
},
"bestfriend_missing": {
"doc_count": 1
}
}
...
What I need now is to get the average number of missing fields for each document. I can just do the math by code on the results:
sum all the missing aggregations doc_count value
divide by the total number of hits
But how would you get the same result as an aggregation from Elasticsearch?
Thank you for any reply / suggestion.
This is an ugly solution but it does the trick.
GET missing/missing/_search
{
"size": 0,
"aggs": {
"result": {
"terms": {
"script": "'aaa'"
},
"aggs": {
"name_missing": {
"missing": {
"field": "name"
}
},
"nickname_missing": {
"missing": {
"field": "nickname"
}
},
"hobby_missing": {
"missing": {
"field": "hobby"
}
},
"bestfriend_missing": {
"missing": {
"field": "bestfriend"
}
},
"avg_missing": {
"bucket_script": {
"buckets_path": { // This is kind of defining variables. name_missing._count will take the doc_count of the name_missing aggregation and same for others(nickname_missing,hobby_missing,bestfriend_missing) as well. "count":"_count" will take doc_count of the documents on which aggregation is performed(total no. of Hits).
"name_missing": "name_missing._count",
"nickname_missing": "nickname_missing._count",
"hobby_missing": "hobby_missing._count",
"bestfriend_missing": "bestfriend_missing._count",
"count":"_count"
},
"script": "(name_missing+nickname_missing+hobby_missing+bestfriend_missing)/count" // Here we are adding all the missing values and dividing it by the total no. of Hits as you require.
}
}
}
}
}
}
I've shown you how to do it, now its on you how you want to massage your parameters and extract what you intend to.

Item variants in ElasticSearch

What is the best way to use item variants in elasticsearch and retrieving only 1 item of the variant group?
For example, let's say I have the following items:
[{
"sku": "abc-123",
"group": "abc",
"color": "red",
"price": 10
},
{
"sku": "def-123",
"group": "def",
"color": "red",
"price": 10
},
{
"sku": "abc-456",
"group": "abc",
"color": "black",
"price": 20
}
]
The first item and the last one are in the same group, so I want only to return one of them if I query for items below the price of 20 (for example), but with the best hit score.
Feel free to suggest documents design and queries accordingly.
If your mapping is of Nested datatype, then you can use this to retrieve them.
GET index/type/_search
{
"size": 2000,
"_source": false,
"query": {
"bool": {
"filter": {
"nested": {
"path": "childs",
"query": {
"bool": {
"filter": {
"term": {
"childs.group.keyword": "abc"
}
}
}
},
"inner_hits": {}
}
}
}
}
}

Elasticsearch: Aggregate distinct values in array

I am using Elasticsearch to store click traffic and each row includes topics of the page which has been visited. A typical row looks like:
{
"date": "2017-09-10T12:26:53.998Z",
"pageid": "10263779",
"loc_ll": [
-73.6487,
45.4671
],
"ua_type": "Computer",
"topics": [
"Trains",
"Planes",
"Electric Cars"
]
}
I want each topics to be a keyword so if I search for cars nothing will be returned. Only Electric Cars would return a result.
I also want to run a distinct query on all topics in all rows so I have a list of all topics used.
Doing this on a pageid would look like like the following, but I am unsure how to approach this for the topics array.
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Your approach to querying and getting the available terms looks fine. Probably you should check your mapping. If you get results for cars this looks as your mapping for topics is an analyzed string (e.g. type text instead of keyword). So please check your mapping for this field.
PUT keywordarray
{
"mappings": {
"item": {
"properties": {
"id": {
"type": "integer"
},
"topics": {
"type": "keyword"
}
}
}
}
}
With this sample data
POST keywordarray/item
{
"id": 123,
"topics": [
"first topic", "second topic", "another"
]
}
and this aggregation:
GET keywordarray/item/_search
{
"size": 0,
"aggs": {
"topics": {
"terms": {
"field": "topics"
}
}
}
}
will result in this:
"aggregations": {
"topics": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "another",
"doc_count": 1
},
{
"key": "first topic",
"doc_count": 1
},
{
"key": "second topic",
"doc_count": 1
}
]
}
}
It is very therapeutic asking on SO. Simply changing the mapping type to keyword allowed me to achieve what I needed.
A part of me thought that it would concatenate the array into a string. But it doesn't
{
"mappings": {
"view": {
"properties": {
"topics": {
"type": "keyword"
},...
}
}
}
}
and a search query like
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Will return a distinct list of all elements in a fields array.

How can I retrieve a document with a defined amount of sorted nested fields?

Here is my problem: let's assume I have a Facebook post indexed on ElasticSearch. This post has many comments as nested fields, which, themselves, have a "likes" count. So, the mapping would be something like this:
"mappings": {
"post": {
"properties": {
"id": {
"type": "integer"
},
"comments": {
"type": "nested",
"properties": {
"like_count": {
"type": "integer"
}
}
}
}
}
}
A post could have thousands of comments, but what if I want to retrieve only the 10 most liked comments from a certain post (so I'd have to define the post's id, limit a size for the field array and define a sort rule)? Is it possible? I've tried many ways using the "nested" query, but with no success.
Any ideas?
Edit: one of the queries I tried, in case anyone still has a doubt about what I want:
{
"query": {
"match": {
"id": 81500
}
},
"sort": [
{
"comments.like_count":
{
"order": "desc"
}
}
],
"nested": {
"path": "comments",
"inner_hits": {
"size": 10
}
}
}
Try this query:
{
"query": {
"match": {
"id": 81500
}
},
"sort": [
{
"comments.like_count":
{
"order": "desc"
}
}
],
"size": 10
}

Resources