How to get the average count of missing field per document with Elasticsearch? - elasticsearch

Shortly: with Elasticsearch, given a list of fields, how can I get the average number of missing fields per document as an aggregation?
Details
With the missing aggregation type I can get the total number of documents where a given field is missing. So with the following data:
"hits": [{
"name": "A name",
"nickname": "A nickname",
"bestfriend": "A friend",
"hobby": "An hobby"
},{
"name": "A name",
"hobby": "An hobby"
},{
"name": "A name",
"nickname": "A nickname",
"hobby": "An hobby"
},{
"name": "A name",
"bestfriend": "A friend"
}]
I can run the following query:
{
"aggs": {
"name_missing": {
"missing": {"field": "name"}
},
"nickname_missing": {
"missing": {"field": "nickname"}
},
"hobby_missing": {
"missing": {"field": "hobby"}
},
"bestfriend_missing": {
"missing": {"field": "bestfriend"}
}
}
}
And I get the following aggregations:
...
"aggregations": {
"name_missing": {
"doc_count": 0
},
"nickname_missing": {
"doc_count": 2
},
"hobby_missing": {
"doc_count": 1
},
"bestfriend_missing": {
"doc_count": 1
}
}
...
What I need now is to get the average number of missing fields for each document. I can just do the math by code on the results:
sum all the missing aggregations doc_count value
divide by the total number of hits
But how would you get the same result as an aggregation from Elasticsearch?
Thank you for any reply / suggestion.

This is an ugly solution but it does the trick.
GET missing/missing/_search
{
"size": 0,
"aggs": {
"result": {
"terms": {
"script": "'aaa'"
},
"aggs": {
"name_missing": {
"missing": {
"field": "name"
}
},
"nickname_missing": {
"missing": {
"field": "nickname"
}
},
"hobby_missing": {
"missing": {
"field": "hobby"
}
},
"bestfriend_missing": {
"missing": {
"field": "bestfriend"
}
},
"avg_missing": {
"bucket_script": {
"buckets_path": { // This is kind of defining variables. name_missing._count will take the doc_count of the name_missing aggregation and same for others(nickname_missing,hobby_missing,bestfriend_missing) as well. "count":"_count" will take doc_count of the documents on which aggregation is performed(total no. of Hits).
"name_missing": "name_missing._count",
"nickname_missing": "nickname_missing._count",
"hobby_missing": "hobby_missing._count",
"bestfriend_missing": "bestfriend_missing._count",
"count":"_count"
},
"script": "(name_missing+nickname_missing+hobby_missing+bestfriend_missing)/count" // Here we are adding all the missing values and dividing it by the total no. of Hits as you require.
}
}
}
}
}
}
I've shown you how to do it, now its on you how you want to massage your parameters and extract what you intend to.

Related

Group by terms and get count of nested array property?

I would like to get the count from a document series where an array item matches some value.
I have documents like these:
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 10
},{
"State": "PENDING"
"Timer": 5
}]
}
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 5
},{
"State": "PENDING"
"Timer": 2
}]
}
{
"Name": "martin",
"Todos": [{
"State": "COMPLETED"
"Timer": 15
},{
"State": "PENDING"
"Timer": 10
}]
}
I would like to count how many documents I have where they have any Todos with COMPLETED State. And group by Name.
So from the above I would need to get:
jason: 2
martin: 1
Usually I do this with a term aggregation for the Name, and an other sub aggregation for other items:
"aggs": {
"statistics": {
"terms": {
"field": "Name"
},
"aggs": {
"test": {
"filter": {
"bool": {
"must": [{
"match_phrase": {
"SomeProperty.keyword": {
"query": "THEVALUE"
}
}
}
]
}
},
But not sure how to do this here as I have items in an array.
Elasticsearch has no problem with arrays because in fact it flattens them by default:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So a query like the one you posted will do. I would use term query for keyword datatype, though:
POST mytodos/_search
{
"size": 0,
"aggs": {
"by name": {
"terms": {
"field": "Name"
},
"aggs": {
"how many completed": {
"filter": {
"term": {
"Todos.State": "COMPLETED"
}
}
}
}
}
}
}
I am assuming your mapping looks something like this:
PUT mytodos/_mappings
{
"properties": {
"Name": {
"type": "keyword"
},
"Todos": {
"properties": {
"State": {
"type": "keyword"
},
"Timer": {
"type": "integer"
}
}
}
}
}
The example documents that you posted will be transformed internally into something like this:
{
"Name": "jason",
"Todos.State": ["COMPLETED", "PENDING"],
"Todos.Timer": [10, 5]
}
However, if you need to query for Todos.State and Todos.Timer, for example, filter for those "COMPLETED" but only with Timer > 10, it will not be possible with such mapping because Elasticsearch forgets the link between fields of object array items.
In this case you would need to use something like nested datatype for such arrays, and query them with special nested query.
Hope that helps!

Multiple key aggregation in ElasticSearch

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}

Filter elasticsearch bucket aggregation based on term field

I have a list of products (deal entities) and I'm attempting to create a bucket aggregation by categories, ordered by the sum of available_stock.
This all works fine, but I want to exclude such categories from the resulting aggregation that don't have level set to 1 (In other words, I only want to keep aggregations on category where level IS 1).
I am aware that elasticsearch provides "exclude" and "include" parameters, but these only work on the same field I'm aggregating on (deal.category.id in this case)
This is my sample deal document:
{
"_source": {
"id": 392745,
"category": [
{
"id": 17575,
"level": 2
},
{
"id": 17574,
"level": 1
},
{
"id": 17572,
"level": 0
}
],
"stats": {
"available_stock": 500
}
}
}
And this would be the query:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
}
},
"aggs": {
"mainAggregation": {
"terms": {
"field": "deal.category.id",
"order": {
"available_stock": "desc"
},
"size": 3
},
"aggs": {
"available_stock": {
"sum": {
"field": "deal.stats.available_stock"
}
}
}
}
},
"size": 0
}
And my resulting aggregation, sadly including category 17572 with level 0.
{
"aggregations": {
"mainAggregation": {
"buckets": [
{
"key": 17572,
"doc_count": 30,
"available_stock": {
"value": 24000
}
},
{
"key": 17598,
"doc_count": 10,
"available_stock": {
"value": 12000
}
},
{
"key": 17602,
"doc_count": 8,
"available_stock": {
"value": 6000
}
}
]
}
}
}
P.S.: Currently on ElasticSearch 1.6
Update 1: Still stuck on the problem after various experiments with various combimation of subaggregations.
I have found this impossible to solve and decided to go with two separate queries.

elastic search : Aggregating the specific nested documents only

I want to aggregate the specific nested documents which satisfies the given query.
Let me explain it through an example. I have inserted two records in my index:
First document is,
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
second document is,
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
Which contains the subject along with the marks in each record. From that documents, I need to have an average of maths subject alone from the given documents.
The query I tried is:
{
"size": 0,
"aggs": {
"avg_marks": {
"avg": {
"field": "project.marks"
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "project.subject:maths",
"analyze_wildcard": true,
"default_field": "*"
}
}
]
}
}
}
Which is returning the result of aggregating all the marks average which is not required.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg_marks": {
"value": 51.5
}
}
}
I just need an average of maths subject from the given documents, in which the expected result is 56.00
any help with the query or idea will be helpful.
Thanks in advance.
First you need in your mapping to specify that index have nested field like following:
PUT /nested-index {
"mappings": {
"document": {
"properties": {
"project": {
"type": "nested",
"properties": {
"subject": {
"type": "keyword"
},
"marks": {
"type": "long"
}
}
}
}
}
}
}
then you insert your docs:
PUT nested-index/document/1
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
then insert second doc:
PUT nested-index/document/2
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
and then you do aggregation but specify that you have nested structure like this:
GET nested-index/_search
{
"size": 0,
"aggs": {
"subjects": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"terms": {
"field": "project.subject",
"size": 10
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}
and why your query is not working and why give that result is because when you have nested field and do average it sums all number from one array if in that array you have some keyword doesn't matter that you want to aggregate only by one subject.
So if you have those two docs because in both docs you have math subject avg will be calculated like this:
(47 + 22 + 65 + 72) / 4 = 51.5
if you want avg for networks it will return you (because in one document you have network but it will do avg over all values in array):
65 + 72 = 68.5
so you need to use nested structure in this case.
If you are interested just for one subject you can than do aggregation just for subject equal to something like this (subject equal to "maths"):
GET nested-index/_search
{
"size": 0,
"aggs": {
"project": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"filter": {
"term": {
"project.subject": "maths"
}
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}

Elasticsearch: Aggregate distinct values in array

I am using Elasticsearch to store click traffic and each row includes topics of the page which has been visited. A typical row looks like:
{
"date": "2017-09-10T12:26:53.998Z",
"pageid": "10263779",
"loc_ll": [
-73.6487,
45.4671
],
"ua_type": "Computer",
"topics": [
"Trains",
"Planes",
"Electric Cars"
]
}
I want each topics to be a keyword so if I search for cars nothing will be returned. Only Electric Cars would return a result.
I also want to run a distinct query on all topics in all rows so I have a list of all topics used.
Doing this on a pageid would look like like the following, but I am unsure how to approach this for the topics array.
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Your approach to querying and getting the available terms looks fine. Probably you should check your mapping. If you get results for cars this looks as your mapping for topics is an analyzed string (e.g. type text instead of keyword). So please check your mapping for this field.
PUT keywordarray
{
"mappings": {
"item": {
"properties": {
"id": {
"type": "integer"
},
"topics": {
"type": "keyword"
}
}
}
}
}
With this sample data
POST keywordarray/item
{
"id": 123,
"topics": [
"first topic", "second topic", "another"
]
}
and this aggregation:
GET keywordarray/item/_search
{
"size": 0,
"aggs": {
"topics": {
"terms": {
"field": "topics"
}
}
}
}
will result in this:
"aggregations": {
"topics": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "another",
"doc_count": 1
},
{
"key": "first topic",
"doc_count": 1
},
{
"key": "second topic",
"doc_count": 1
}
]
}
}
It is very therapeutic asking on SO. Simply changing the mapping type to keyword allowed me to achieve what I needed.
A part of me thought that it would concatenate the array into a string. But it doesn't
{
"mappings": {
"view": {
"properties": {
"topics": {
"type": "keyword"
},...
}
}
}
}
and a search query like
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Will return a distinct list of all elements in a fields array.

Resources