ElasticSearch 1x - aggregate on object conditions - elasticsearch

I want to aggregate on data, which has inner objects. For example:
{
"_index": "product_index-en",
"_type": "elasticproductmodel",
"_id": "000001111",
"_score": 6.3316255,
"_source": {
"productId": "11111111111",
"productIdOnlyLetterAndDigit": "11111111111",
"productIdOnlyDigit": "11111111111",
"productNumber": "11111111111",
"name": "Glow Plug",
"nameOnlyLetterAndDigit": "glowplug",
"productImageLarge": "11111111111.jpg",
"itemGroupId": "11111",
"relatedProductIds": [],
"dataAreaCountries": [
"fra",
"pol",
"uk",
"sie",
"sve",
"atl",
"ita",
"hol",
"dk"
],
"oemItems": [
{
"manufactorName": "BERU",
"manufacType": "0"
},
{
"manufactorName": "LUCAS",
"manufacType": "0"
}
]
}
}
I need to be able aggregates oemItems.manufactorName values, but only where oemItems.manufacType is "0". I have tried a number of examples, such as the accepted one here ( Elastic Search Aggregate into buckets on conditions ), but I just cannot seem to wrap my head around it.
I've tried following, hopeing it will aggragate on manufacType first, which it does, and then manufactorName for each type, which it seems to display correct hit count. However, buckets for manufactorName are empty:
GET /product_index-en/_search
{
"size": 0,
"aggs": {
"baked_goods": {
"nested": {
"path": "oemItems"
},
"aggs": {
"test1": {
"terms": {
"field": "oemItems.manufacType",
"size": 500
},
"aggs": {
"test2": {
"terms": {
"field": "oemItems.manufactorName",
"size": 500
}
}
}
}
}
}
}
}
And the result:
{
"took": 27,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 471214,
"max_score": 0,
"hits": []
},
"aggregations": {
"baked_goods": {
"doc_count": 677246,
"test1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "0",
"doc_count": 436557,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key": "1",
"doc_count": 240689,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
]
}
}
}
}
I have also tried to add a nested term filter, to only look at oemItems which have manufacType 1 with following query. However, it returns Objects where oemItems include manufacType 1, meaning it oemItems within products still contain either 1 or 0 manufacType. I don't see how doing an aggregate on this response will only return oemItems.manufactorName where oemItems.manufacType is 0
GET /product_index-en/_search
{
"query" : { "match_all" : {} },
"filter" : {
"nested" : {
"path" : "oemItems",
"filter" : {
"bool" : {
"must" : [
{
"term" : {"oemItems.manufacType" : "1"}
}
]
}
}
}
}
}

Good start so far. Just try it like this:
POST /product_index-en/_search
{
"size": 0,
"query": {
"nested": {
"path": "oemItems",
"query": {
"term": {
"oemItems.manufacType": "0"
}
}
}
},
"aggs": {
"baked_goods": {
"nested": {
"path": "oemItems"
},
"aggs": {
"test1": {
"terms": {
"field": "oemItems.manufactorName",
"size": 500
}
}
}
}
}
}

Related

Elasticsearch- how to do query with condition and sort out the required fied without duplicate

i was new to ES, i am now able to sort a field in my DB with condition
alarm!=0
the code was here:
{
"size":1,
"query": {
"bool": {
"must_not": {
"term": {
"header.alarmStatus": 0
}
}
}//bool
//query
}
}
and it shows that it have around 4000 hit which is all i want
the response for that (size = 1) was as follow
"hits": {
"total": {
"value": 3842,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "index123",
"_type": "meter",
"_id": "63iti3QBSliyJ__JFt6C",
"_score": 0.0,
"_source": {
"header": {
"meterId": 1245,
},
"data": {
"seqNum": 72
}
}
]
}
And my question is how can i do the query with condition "header.alarmStatus": !=0
and list all the meter ID with duplicate counts ?
thanks
Jeff
As far as I can understand, you need to list all the meterId (removing the duplicate count) for the query with the condition "header.alarmStatus": !=0. For this, you can use the terms aggregation with cardinality aggregation as sub aggregation
Index Data:
{
"header": {
"meterId": 1246,
"alarmStatus": 3
},
"data": {
"seqNum": 72
}
}
{
"header": {
"meterId": 1246,
"alarmStatus": 2
},
"data": {
"seqNum": 72
}
}
{
"header": {
"meterId": 1245,
"alarmStatus": 1
},
"data": {
"seqNum": 72
}
}
Search Query:
{
"query": {
"bool": {
"must_not": {
"term": {
"header.alarmStatus": 0
}
}
}
},
"aggs": {
"genres": {
"terms": {
"field": "header.meterId"
},
"aggs": {
"item_count": {
"cardinality": {
"field": "header.meterId"
}
}
}
}
}
}
Search Result:
"aggregations": {
"genres": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1246,
"doc_count": 2,
"item_count": {
"value": 1
}
},
{
"key": 1245,
"doc_count": 1,
"item_count": {
"value": 1
}
}
]
}

Elasticsearch: Can I return only the cardinality of a buckets agg, without returning all the buckets?

Take the following query and result,
POST index/_search
{
"size": 0,
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
result:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "aa",
"doc_count": 3,
"score_avg": {
"value": 3.8
}
},
{
"key": "bb",
"doc_count": 1,
"score_avg": {
"value": 3.8
}
}
]
},
"count": {
"value": 2
}
}
That's great. But in my situation, I don't really care about information about each bucket. I only want to know the # of buckets. Something like the following:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"bucket_count": 2
}
}
Is this possible in Elasticsearch?
Edit:
You might wonder why I calculate an average (which limits using terms instead of cardinality) if I don't care about what's in buckets. I do use the average to do a range aggregation. My actual problem is like folowing: The above question was simplified.
POST index/_search
{
"size": 0,
"aggs" : {
"mos_over_time" : {
"range" : {
"field" : "device_score",
"ranges" : [
{ "from" : 0.0, "to" : 2.6 },
{ "from" : 2.6, "to" : 4.0 },
{ "from" : 4.0 }
]
},
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
}
}

Elasticsearch: Querying nested objects

Dear elasticsearch experts,
i have a problem querying nested objects. Lets use the following simplified mapping:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
And put some documents in the index:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
Now i want to execute queries like this:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
As result I want to get the matching companies with the number of matching documents. So the above query should give me:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
The following query:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
should give me all companies assigned to a document whichs title contains "GPU" with the number of matching documents:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
Is there any possibility with good performance to achieve this result? I'm explicitly not interested in the matching documents, only in the number of matched documents and the nested objects.
Thanks for your help.
What you need to do in terms of Elasticsearch is:
filter "parent" documents on desired criteria (like having GPU in title, or also mentioning Nvidia in the companies list);
group "nested" documents by a certain criteria, a bucket (e.g. company_id);
count how many "nested" documents there are per each bucket.
Each of the nested objects in the array are indexed as a separate hidden document, which complicates life a bit. Let's see how to aggregate on them.
So how to aggregate and count the nested documents?
You can achieve this with a combination of a nested, terms and top_hits aggregation:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
This will give the following output:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
Notice that for Nvidia we have "doc_count": 2.
But what if we want to count the number of "parent" objects who's got Nvidia vs Intel?
What if we want to count parent objects based on a nested bucket?
It can be achieved with reverse_nested aggregation.
We need to change our query just a little bit:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
The result will look like this:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
How can I spot the difference?
To make the difference evident, let's change the data a bit and add another Nvidia item in the document list:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
The last query (the one with reverse_nested) will give us the following:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
As you can see, this is a subtle difference that is hard to grasp, but it changes the semantics completely.
What's about performance?
While for most of the cases the performance of nested query and aggregations should be enough, of course it comes with a certain cost. It is therefore recommended to avoid using nested or parent-child types when tuning for search speed.
In Elasticsearch the best performance is often achieved through denormalization, although there is no single recipe and you should select the data model depending on your needs.
Hope this clarifies this nested thing for you a bit!

Elasticsearch Histogram of visits

I'm quite new to Elasticsearch and I fail to build a histogram based on ranges of visits. I am not even sure that it's possible to create this kind of chart by using a single query in Elasticsearch, but I'm the feeling that could be possible with pipeline aggregation or may be scripted aggregation.
Here is a test dataset with which I'm working:
PUT /test_histo
{ "settings": { "number_of_shards": 1 }}
PUT /test_histo/_mapping/visit
{
"properties": {
"user": {"type": "string" },
"datevisit": {"type": "date"},
"page": {"type": "string"}
}
}
POST test_histo/visit/_bulk
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Jean","page":"productXX.hmtl","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Robert","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"media_center.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"media_center.html","datevisit":"2015-11-26"}
If we consider the ranges [1,2[, [2,3[, [3, inf.[
The expected result should be :
[1,2[ = 2
[2,3[ = 1
[3, inf.[ = 1
All my efforts to find the histogram showing a customer visit frequency remained to date unsuccessful. I would be pleased to have a few tips, tricks or ideas to get a response to my problem.
There are two ways you can do it.
First is doing it in ElasticSearch which will require Scripted Metric Aggregation. You can read more about it here.
Your query would look like this
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"scripted_metric": {
"init_script": "_agg['values'] = new java.util.HashMap();",
"map_script": "if (_agg.values[doc['user'].value]==null) {_agg.values[doc['user'].value]=1} else {_agg.values[doc['user'].value]+=1;}",
"combine_script": "someHashMap = new java.util.HashMap();for(x in _agg.values.keySet()) {value=_agg.values[x];if(value<3){key='[' + value +',' + (value + 1) + '[';}else{key='[' + value +',inf[';}; if(someHashMap[key]==null){someHashMap[key] = 1}else{someHashMap[key] += 1}}; return someHashMap;"
}
}
}
}
}
}
where you can change period of time in date_histogram object in the field interval by values like day, week, month.
Your response would look like this
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"value": [
{
"[2,3[": 1,
"[3,inf[": 1,
"[1,2[": 2
}
]
}
}
]
}
}
}
Second method is to the work of scripted_metric in client side. You can use the result of Terms Aggregation. You can read more about it here.
Your query will look like this
GET test_histo/visit/_search
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"terms": {
"field": "user",
"size": 10
}
}
}
}
}
}
and the response will be
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john",
"doc_count": 3
},
{
"key": "mary",
"doc_count": 2
},
{
"key": "jean",
"doc_count": 1
},
{
"key": "robert",
"doc_count": 1
}
]
}
}
]
}
}
}
where on the response you can do count for each doc_count for each period.
Have a look at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
If you whant to show it in fancy already fixed UI use Kibana.
A query like this:
GET _search
{
"query": {
"match_all": {}
},
{
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Should give you a histogram, I don't have elastic here at the moment so I might have some fat finggered typos.
Then you could ad query terms to only show histogram for specific page our you could have an aouter aggregation bucket wich aggregates / page or user.
Something like this:
GET _search
{
"query": {
"match_all": {}
},
{
{
"aggs" : {
"users" : {
"terms" : {
"field" : "user",
},
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Have a look to this solution:
{
"query": {
"match_all": {}
},
"aggs": {
"periods": {
"filters": {
"filters": {
"1-2": {
"range": {
"datevisit": {
"gte": "2015-11-25",
"lt": "2015-11-26"
}
}
},
"2-3": {
"range": {
"datevisit": {
"gte": "2015-11-26",
"lt": "2015-11-27"
}
}
},
"3-": {
"range": {
"datevisit": {
"gte": "2015-11-27",
}
}
}
}
},
"aggs": {
"users": {
"terms": {"field": "user"}
}
}
}
}
}
Step by step:
Filter aggregation: You can define ranged values for the next aggregation, in this case we define 3 periods based on date range filter
Nested Users aggregation: This aggregation returns as many results as filters you'd defined. So, in this case, you'll get 3 values using range date filtering
You'll get a result like this:
{
...
"aggregations" : {
"periods" : {
"buckets" : {
"1-2" : {
"users" : {
"buckets" : [
{"key" : XXX,"doc_count" : NNN},
{"key" : YYY,"doc_count" : NNN},
]
}
},
"2-3" : {
"users" : {
"buckets" : [
{"key" : XXX1,"doc_count" : NNN1},
{"key" : YYY1,"doc_count" : NNN1},
]
}
},
"3-" : {
"users" : {
"buckets" : [
{"key" : XXX2,"doc_count" : NNN2},
{"key" : YYY2,"doc_count" : NNN2},
]
}
},
}
}
}
}
Try it, and tell if it works

Limit aggregations to list of values

Can I limit aggregations to return only specific list of values? I have something like this:
{ "aggs" : {
"province" : {
"terms" : {
"field" : "province"
}
}
},
"query": {
"bool": {
//my query..
But let's say I know list of province for which I want make count ({'province1', 'province2', 'province3'}). Is it possible to restrict returned list of province without influence on my query results?
I want to get:
//list of hits..
//
"aggregations": {
"province": {
"buckets": [
{
"key": "province1",
"doc_count": 200
},
{
"key": "province2",
"doc_count": 162
},
{
"key": "province3",
"doc_count": 162
}
// even if there is more possible provinces
// I don't want to see them
Sure, just use term filters.
Here's an example. Let's say I have visit stats for a bunch of different IP addresses, but I only want to get counts of document for two of them, I could do this:
POST /test_index/_search?search_type=count
{
"aggregations": {
"ip": {
"terms": {
"field": "ip",
"size": 10,
"include": [
"146.233.189.126",
"193.33.153.89"
]
}
}
}
}
and get back something like:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"ip": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "146.233.189.126",
"doc_count": 3
},
{
"key": "193.33.153.89",
"doc_count": 3
}
]
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/68697646ef7afc9f0375995b6f84181a7ac4cba9
So your example might look like:
{
"aggs": {
"province": {
"terms": {
"field": "province",
"include": [
"province1",
"province2",
"province3"
]
}
}
}
}

Resources