elastic search : Aggregating the specific nested documents only - elasticsearch

I want to aggregate the specific nested documents which satisfies the given query.
Let me explain it through an example. I have inserted two records in my index:
First document is,
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
second document is,
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
Which contains the subject along with the marks in each record. From that documents, I need to have an average of maths subject alone from the given documents.
The query I tried is:
{
"size": 0,
"aggs": {
"avg_marks": {
"avg": {
"field": "project.marks"
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "project.subject:maths",
"analyze_wildcard": true,
"default_field": "*"
}
}
]
}
}
}
Which is returning the result of aggregating all the marks average which is not required.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg_marks": {
"value": 51.5
}
}
}
I just need an average of maths subject from the given documents, in which the expected result is 56.00
any help with the query or idea will be helpful.
Thanks in advance.

First you need in your mapping to specify that index have nested field like following:
PUT /nested-index {
"mappings": {
"document": {
"properties": {
"project": {
"type": "nested",
"properties": {
"subject": {
"type": "keyword"
},
"marks": {
"type": "long"
}
}
}
}
}
}
}
then you insert your docs:
PUT nested-index/document/1
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
then insert second doc:
PUT nested-index/document/2
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
and then you do aggregation but specify that you have nested structure like this:
GET nested-index/_search
{
"size": 0,
"aggs": {
"subjects": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"terms": {
"field": "project.subject",
"size": 10
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}
and why your query is not working and why give that result is because when you have nested field and do average it sums all number from one array if in that array you have some keyword doesn't matter that you want to aggregate only by one subject.
So if you have those two docs because in both docs you have math subject avg will be calculated like this:
(47 + 22 + 65 + 72) / 4 = 51.5
if you want avg for networks it will return you (because in one document you have network but it will do avg over all values in array):
65 + 72 = 68.5
so you need to use nested structure in this case.
If you are interested just for one subject you can than do aggregation just for subject equal to something like this (subject equal to "maths"):
GET nested-index/_search
{
"size": 0,
"aggs": {
"project": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"filter": {
"term": {
"project.subject": "maths"
}
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}

Related

Stats Aggregation with Min Mode in ElasticSearch

I have the below mapping in ElasticSearch
{
"properties":{
"Costs":{
"type":"nested",
"properties":{
"price":{
"type":"integer"
}
}
}
}
}
So every document has an Array field Costs, which contains many elements and each element has price in it. I want to find the min and max price with the condition being - that from each array the element with the minimum price should be considered. So it is basically min/max among the minimum value of each array.
Lets say I have 2 documents with the Costs field as
Costs: [
{
"price": 100,
},
{
"price": 200,
}
]
and
Costs: [
{
"price": 300,
},
{
"price": 400,
}
]
So I need to find the stats
This is the query I am currently using
{
"costs_stats":{
"nested":{
"path":"Costs"
},
"aggs":{
"price_stats_new":{
"stats":{
"field":"Costs.price"
}
}
}
}
}
And it gives me this:
"min" : 100,
"max" : 400
But I need to find stats after taking minimum elements of each array for consideration.
So this is what i need:
"min" : 100,
"max" : 300
Like we have a "mode" option in sort, is there something similar in stats aggregation also, or any other way of achieving this, maybe using a script or something. Please suggest. I am really stuck here.
Let me know if anything is required
Update 1:
Query for finding min/max among minimums
{
"_source":false,
"timeout":"5s",
"from":0,
"size":0,
"aggs":{
"price_1":{
"terms":{
"field":"id"
},
"aggs":{
"price_2":{
"nested":{
"path":"Costs"
},
"aggs":{
"filtered":{
"aggs":{
"price_3":{
"min":{
"field":"Costs.price"
}
}
},
"filter":{
"bool":{
"filter":{
"range":{
"Costs.price":{
"gte":100
}
}
}
}
}
}
}
}
}
},
"minValue":{
"min_bucket":{
"buckets_path":"price_1>price_2>filtered>price_3"
}
}
}
}
Only few buckets are coming and hence the min/max is coming among those, which is not correct. Is there any size limit.
One way to achieve your use case is to add one more field id, in each document. With the help of id field terms aggregation can be performed, and so buckets will be dynamically built - one per unique value.
Then, we can apply min aggregation, which will return the minimum value among numeric values extracted from the aggregated documents.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"Costs": {
"type": "nested"
}
}
}
}
Index Data:
{
"id":1,
"Costs": [
{
"price": 100
},
{
"price": 200
}
]
}
{
"id":2,
"Costs": [
{
"price": 300
},
{
"price": 400
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
}
Using stats aggregation also, it can be achieved (if you add one more field id that uniquely identifies your document)
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"costs_stats": {
"nested": {
"path": "Costs"
},
"aggs": {
"price_stats_new": {
"stats": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Update 1:
To find the maximum value among those minimums (as seen in the above query), you can use max bucket aggregation
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
},
"maxValue": {
"value": 300.0,
"keys": [
"2"
]
}
}

Multiple key aggregation in ElasticSearch

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}

elasticsearh how to find the percentage of document that has less prices than a number

Let's say i want to allow the users to enter the name of the city and the price of a thing (anything)
I need to know the percentage of (things) in that city that has less value for a field than an entered value:
i can search for a city like this:
"query": {
"filtered": {
"query": {
"match": {
"city": "Paris"
}
}
}
},
but i don't know how to do the other requirements, could you help me please?
Suposedly percentile-ranks-aggregation was intended as a means to achieve this
Example:
post <index>/<type>/_search
{
"filter": {
"term": {
"city": "blore"
}
},
"aggs": {
"rank": {
"percentile_ranks": {
"values": [
31
],
"field": "price"
}
}
},
"size": 0
}
But when testing I found that it is buggy I believe is related to issue.
So the work around would be to calculate the percentage on client side once the document counts have been acquired using a query similar to as follows:
post _search
{
"filter": {
"term": {
"city": "blore"
}
},
"aggs": {
"less_price_filter": {
"filter": {
"range": {
"price": {
"lt": 60
}
}
}
}
},
"size": 0
}
Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"less_price_filter": {
"doc_count": 2
}
}
}
The percentage can be calculated on client-side by dividing doc_count*100/total.

Aggregate only matched nested object values in ElasticSearch

I need to sum only the values on the nested objects that match the query. It looks like ElasticSearch determines the documents matching the query and then sums across all of the nested objects. From the below outline I want to search on nestedobjects.objtype="A" and get back the sum of objvalue only for matching nestedobjects, I want to get the value 4. is this possible? If so, how?
Here is the mapping
{
"myindex": {
"mappings": {
"mytype": {
"properties": {
"nestedobjects": {
"type": "nested",
"include_in_parent": true,
"properties": {
"objtype": {
"type": "string"
},
"objvalue": {
"type": "integer"
}
}
}
}
}
}
}
}
Here are my documents
PUT /myindex/mytype/1
{
"nestedobjects": [
{ "objtype": "A", "objvalue": 1 },
{ "objtype": "B", "objvalue": 2 }
]
}
PUT /myindex/mytype/2
{
"nestedobjects": [
{ "objtype": "A", "objvalue": 3 },
{ "objtype": "B", "objvalue": 3 }
]
}
Here is my query code.
POST allscriptshl7/_search?search_type=count
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "nestedobjects.objtype:A"
}
}
}
},
"aggregations": {
"my_agg": {
"sum": {
"field": "nestedobjects.objvalue"
}
}
}
}
Since both (outer) documents match the condition that one of their inner documents match the query, both outer documents are returned, and the aggregation is calculated against all of the inner documents belonging to those outer documents. Whew.
Anyway, this seems to do what you're wanting, I think, using filter aggregation:
POST /myindex/_search?search_type=count
{
"aggs": {
"nested_nestedobjects": {
"nested": {
"path": "nestedobjects"
},
"aggs": {
"filtered_nestedobjects": {
"filter": {
"term": {
"nestedobjects.objtype": "a"
}
},
"aggs": {
"my_agg": {
"sum": {
"field": "nestedobjects.objvalue"
}
}
}
}
}
}
}
}
...
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_nestedobjects": {
"doc_count": 4,
"filtered_nestedobjects": {
"doc_count": 2,
"my_agg": {
"value": 4,
"value_as_string": "4.0"
}
}
}
}
}
Here is some code I used to test it:
http://sense.qbox.io/gist/c1494619ff1bd0394d61f3d5a16cb9dfc229113a
Very well-structured question, by the way.

Trying to extract a leaf field from Elasticsearch

I have an object in elasticsearch which resembles something like this:
{
"text": "something something something",
"entities": { "hashtags":["test","test123"]}
}
The problem is that not each document has the entities attribute set. So I want to write a query which:
must contain a keyword in the text field
must have the entities field
extracts the entities.hashtag field
I'm trying to extract a leaf field using following query, the problem is I still get documents which don't have an entities field.
For the second part of the question, I was wondering: How do I only extract the entities.hashtags field? I tried something like "fields": ["entities.hashtags"] but it didn't work.
{
"size": 2000,
"query": {
"filtered": {
"query": {
"match_all": {
}
},
"filter": {
"bool": {
"must": [{
"term": {
"text": "something"
}
},
{
"missing": {
"field": "entities",
"existence": true
}
}]
}
}
}
}
}
This seems to do what you want, if I'm understanding you correctly. A "term" filter on the "text" field and an "exists" filter on the "entities" field filters the docs, and a "terms" aggregation on "entities.hashtags" extracts the values. I'll just post the full example I used:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
PUT /test_index/doc/1
{
"text": "something something something",
"entities": { "hashtags": ["test","test123"] }
}
PUT /test_index/doc/2
{
"text": "another doc",
"entities": { "hashtags": ["testagain","testagain123"] }
}
PUT /test_index/doc/3
{
"text": "doc with no entities"
}
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{ "term": { "text": "something" } },
{ "exists": { "field": "entities" } }
]
}
}
}
},
"aggs": {
"hashtags": {
"terms": {
"field": "entities.hashtags"
}
}
}
}
...
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"hashtags": {
"buckets": [
{
"key": "test",
"doc_count": 1
},
{
"key": "test123",
"doc_count": 1
}
]
}
}
}

Resources